Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000337 ( Pmc/Corpus ); précédent : 0003369; suivant : 0003380 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Next generation sequencing reads comparison with an alignment-free distance</title>
<author>
<name sortKey="Weitschek, Emanuel" sort="Weitschek, Emanuel" uniqKey="Weitschek E" first="Emanuel" last="Weitschek">Emanuel Weitschek</name>
<affiliation>
<nlm:aff id="Aff9">Department of Engineering, Roma Tre University, Via della Vasca Navale 79, 00146 Rome, Italy</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff10">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Santoni, Daniele" sort="Santoni, Daniele" uniqKey="Santoni D" first="Daniele" last="Santoni">Daniele Santoni</name>
<affiliation>
<nlm:aff id="Aff10">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fiscon, Giulia" sort="Fiscon, Giulia" uniqKey="Fiscon G" first="Giulia" last="Fiscon">Giulia Fiscon</name>
<affiliation>
<nlm:aff id="Aff10">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff11">Department of Computer, Control, and Management Engineering “Antonio Ruberti”, Viale Ariosto 25, 00185 Rome, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="De Cola, Maria Cristina" sort="De Cola, Maria Cristina" uniqKey="De Cola M" first="Maria Cristina" last="De Cola">Maria Cristina De Cola</name>
<affiliation>
<nlm:aff id="Aff12">IRCCS Centro Neurolesi “Bonino-Pulejo”, S.S.113 Via Palermo C/da Casazza, 98123 Messina, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bertolazzi, Paola" sort="Bertolazzi, Paola" uniqKey="Bertolazzi P" first="Paola" last="Bertolazzi">Paola Bertolazzi</name>
<affiliation>
<nlm:aff id="Aff10">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Felici, Giovanni" sort="Felici, Giovanni" uniqKey="Felici G" first="Giovanni" last="Felici">Giovanni Felici</name>
<affiliation>
<nlm:aff id="Aff10">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25465386</idno>
<idno type="pmc">4265526</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4265526</idno>
<idno type="RBID">PMC:4265526</idno>
<idno type="doi">10.1186/1756-0500-7-869</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000337</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000337</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Next generation sequencing reads comparison with an alignment-free distance</title>
<author>
<name sortKey="Weitschek, Emanuel" sort="Weitschek, Emanuel" uniqKey="Weitschek E" first="Emanuel" last="Weitschek">Emanuel Weitschek</name>
<affiliation>
<nlm:aff id="Aff9">Department of Engineering, Roma Tre University, Via della Vasca Navale 79, 00146 Rome, Italy</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff10">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Santoni, Daniele" sort="Santoni, Daniele" uniqKey="Santoni D" first="Daniele" last="Santoni">Daniele Santoni</name>
<affiliation>
<nlm:aff id="Aff10">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fiscon, Giulia" sort="Fiscon, Giulia" uniqKey="Fiscon G" first="Giulia" last="Fiscon">Giulia Fiscon</name>
<affiliation>
<nlm:aff id="Aff10">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff11">Department of Computer, Control, and Management Engineering “Antonio Ruberti”, Viale Ariosto 25, 00185 Rome, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="De Cola, Maria Cristina" sort="De Cola, Maria Cristina" uniqKey="De Cola M" first="Maria Cristina" last="De Cola">Maria Cristina De Cola</name>
<affiliation>
<nlm:aff id="Aff12">IRCCS Centro Neurolesi “Bonino-Pulejo”, S.S.113 Via Palermo C/da Casazza, 98123 Messina, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Bertolazzi, Paola" sort="Bertolazzi, Paola" uniqKey="Bertolazzi P" first="Paola" last="Bertolazzi">Paola Bertolazzi</name>
<affiliation>
<nlm:aff id="Aff10">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Felici, Giovanni" sort="Felici, Giovanni" uniqKey="Felici G" first="Giovanni" last="Felici">Giovanni Felici</name>
<affiliation>
<nlm:aff id="Aff10">Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Research Notes</title>
<idno type="eISSN">1756-0500</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (
<italic>reads</italic>
). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis.</p>
</sec>
<sec>
<title>Methods</title>
<p>We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (
<italic>k</italic>
-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance. The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from
<italic>Saccharomyces cerevisiae</italic>
,
<italic>Escherichia coli</italic>
, and
<italic>Homo sapiens</italic>
. The comparison is based on the correctness of threshold predictors cross-validated over different samples.</p>
</sec>
<sec>
<title>Results</title>
<p>We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/1756-0500-7-869) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Eisenstein, M" uniqKey="Eisenstein M">M Eisenstein</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, L" uniqKey="Liu L">L Liu</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Li, S" uniqKey="Li S">S Li</name>
</author>
<author>
<name sortKey="Hu, N" uniqKey="Hu N">N Hu</name>
</author>
<author>
<name sortKey="He, Y" uniqKey="He Y">Y He</name>
</author>
<author>
<name sortKey="Pong, R" uniqKey="Pong R">R Pong</name>
</author>
<author>
<name sortKey="Lin, D" uniqKey="Lin D">D Lin</name>
</author>
<author>
<name sortKey="Lu, L" uniqKey="Lu L">L Lu</name>
</author>
<author>
<name sortKey="Law, M" uniqKey="Law M">M Law</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Metzker, Ml" uniqKey="Metzker M">ML Metzker</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Earl, D" uniqKey="Earl D">D Earl</name>
</author>
<author>
<name sortKey="Bradnam, K" uniqKey="Bradnam K">K Bradnam</name>
</author>
<author>
<name sortKey="John, Js" uniqKey="John J">JS John</name>
</author>
<author>
<name sortKey="Darling, A" uniqKey="Darling A">A Darling</name>
</author>
<author>
<name sortKey="Lin, D" uniqKey="Lin D">D Lin</name>
</author>
<author>
<name sortKey="Fass, J" uniqKey="Fass J">J Fass</name>
</author>
<author>
<name sortKey="Yu, Hok" uniqKey="Yu H">HOK Yu</name>
</author>
<author>
<name sortKey="Buffalo, V" uniqKey="Buffalo V">V Buffalo</name>
</author>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Diekhans, M" uniqKey="Diekhans M">M Diekhans</name>
</author>
<author>
<name sortKey="Ariyaratne, Pn" uniqKey="Ariyaratne P">PN Ariyaratne</name>
</author>
<author>
<name sortKey="Sung, W K" uniqKey="Sung W">W-K Sung</name>
</author>
<author>
<name sortKey="Ning, Z" uniqKey="Ning Z">Z Ning</name>
</author>
<author>
<name sortKey="Haimel, M" uniqKey="Haimel M">M Haimel</name>
</author>
<author>
<name sortKey="Simpson, Jt" uniqKey="Simpson J">JT Simpson</name>
</author>
<author>
<name sortKey="Fonseca, Na" uniqKey="Fonseca N">NA Fonseca</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
<author>
<name sortKey="Docking, Tr" uniqKey="Docking T">TR Docking</name>
</author>
<author>
<name sortKey="Ho, Iy" uniqKey="Ho I">IY Ho</name>
</author>
<author>
<name sortKey="Rokhsar, Ds" uniqKey="Rokhsar D">DS Rokhsar</name>
</author>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Lavenier, D" uniqKey="Lavenier D">D Lavenier</name>
</author>
<author>
<name sortKey="Chapuis, G" uniqKey="Chapuis G">G Chapuis</name>
</author>
<author>
<name sortKey="Naquin, D" uniqKey="Naquin D">D Naquin</name>
</author>
<author>
<name sortKey="Maillet, N" uniqKey="Maillet N">N Maillet</name>
</author>
<author>
<name sortKey="Schatz, Mc" uniqKey="Schatz M">MC Schatz</name>
</author>
<author>
<name sortKey="Kelley, Dr" uniqKey="Kelley D">DR Kelley</name>
</author>
<author>
<name sortKey="Phillippy, Am" uniqKey="Phillippy A">AM Phillippy</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Nguyen, N" uniqKey="Nguyen N">N Nguyen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bradnam, Kr" uniqKey="Bradnam K">KR Bradnam</name>
</author>
<author>
<name sortKey="Fass, Jn" uniqKey="Fass J">JN Fass</name>
</author>
<author>
<name sortKey="Alexandrov, A" uniqKey="Alexandrov A">A Alexandrov</name>
</author>
<author>
<name sortKey="Baranay, P" uniqKey="Baranay P">P Baranay</name>
</author>
<author>
<name sortKey="Bechner, M" uniqKey="Bechner M">M Bechner</name>
</author>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
<author>
<name sortKey="Boisvert, S" uniqKey="Boisvert S">S Boisvert</name>
</author>
<author>
<name sortKey="Chapman, Ja" uniqKey="Chapman J">JA Chapman</name>
</author>
<author>
<name sortKey="Chapuis, G" uniqKey="Chapuis G">G Chapuis</name>
</author>
<author>
<name sortKey="Chikhi, R" uniqKey="Chikhi R">R Chikhi</name>
</author>
<author>
<name sortKey="Chitsaz, H" uniqKey="Chitsaz H">H Chitsaz</name>
</author>
<author>
<name sortKey="Chou, W C" uniqKey="Chou W">W-C Chou</name>
</author>
<author>
<name sortKey="Corbeil, J" uniqKey="Corbeil J">J Corbeil</name>
</author>
<author>
<name sortKey="Fabbro, Cd" uniqKey="Fabbro C">CD Fabbro</name>
</author>
<author>
<name sortKey="Docking, Tr" uniqKey="Docking T">TR Docking</name>
</author>
<author>
<name sortKey="Durbin, R" uniqKey="Durbin R">R Durbin</name>
</author>
<author>
<name sortKey="Earl, D" uniqKey="Earl D">D Earl</name>
</author>
<author>
<name sortKey="Emrich, S" uniqKey="Emrich S">S Emrich</name>
</author>
<author>
<name sortKey="Fedotov, P" uniqKey="Fedotov P">P Fedotov</name>
</author>
<author>
<name sortKey="Fonseca, Na" uniqKey="Fonseca N">NA Fonseca</name>
</author>
<author>
<name sortKey="Ganapathy, G" uniqKey="Ganapathy G">G Ganapathy</name>
</author>
<author>
<name sortKey="Gibbs, Ra" uniqKey="Gibbs R">RA Gibbs</name>
</author>
<author>
<name sortKey="Gnerre, S" uniqKey="Gnerre S">S Gnerre</name>
</author>
<author>
<name sortKey="Godzaridis, E" uniqKey="Godzaridis E">E Godzaridis</name>
</author>
<author>
<name sortKey="Goldstein, S" uniqKey="Goldstein S">S Goldstein</name>
</author>
<author>
<name sortKey="Haimel, M" uniqKey="Haimel M">M Haimel</name>
</author>
<author>
<name sortKey="Hall, G" uniqKey="Hall G">G Hall</name>
</author>
<author>
<name sortKey="Haussler, D" uniqKey="Haussler D">D Haussler</name>
</author>
<author>
<name sortKey="Hiatt, Jb" uniqKey="Hiatt J">JB Hiatt</name>
</author>
<author>
<name sortKey="Ho, Iy" uniqKey="Ho I">IY Ho</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nagarajan, N" uniqKey="Nagarajan N">N Nagarajan</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blazewicz, J" uniqKey="Blazewicz J">J Blazewicz</name>
</author>
<author>
<name sortKey="Bryja, M" uniqKey="Bryja M">M Bryja</name>
</author>
<author>
<name sortKey="Figlerowicz, M" uniqKey="Figlerowicz M">M Figlerowicz</name>
</author>
<author>
<name sortKey="Gawron, P" uniqKey="Gawron P">P Gawron</name>
</author>
<author>
<name sortKey="Kasprzak, M" uniqKey="Kasprzak M">M Kasprzak</name>
</author>
<author>
<name sortKey="Kirton, E" uniqKey="Kirton E">E Kirton</name>
</author>
<author>
<name sortKey="Platt, D" uniqKey="Platt D">D Platt</name>
</author>
<author>
<name sortKey="Przybytek, J" uniqKey="Przybytek J">J Przybytek</name>
</author>
<author>
<name sortKey="Swiercz, A" uniqKey="Swiercz A">A Swiercz</name>
</author>
<author>
<name sortKey="Szajkowski, L" uniqKey="Szajkowski L">L Szajkowski</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Compeau, Pec" uniqKey="Compeau P">PEC Compeau</name>
</author>
<author>
<name sortKey="Pevzner, Pa" uniqKey="Pevzner P">PA Pevzner</name>
</author>
<author>
<name sortKey="Tesler, G" uniqKey="Tesler G">G Tesler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Birol, I" uniqKey="Birol I">I Birol</name>
</author>
<author>
<name sortKey="Jackman, Sd" uniqKey="Jackman S">SD Jackman</name>
</author>
<author>
<name sortKey="Nielsen, Cb" uniqKey="Nielsen C">CB Nielsen</name>
</author>
<author>
<name sortKey="Qian, Jq" uniqKey="Qian J">JQ Qian</name>
</author>
<author>
<name sortKey="Varhol, R" uniqKey="Varhol R">R Varhol</name>
</author>
<author>
<name sortKey="Stazyk, G" uniqKey="Stazyk G">G Stazyk</name>
</author>
<author>
<name sortKey="Morin, Rd" uniqKey="Morin R">RD Morin</name>
</author>
<author>
<name sortKey="Zhao, Y" uniqKey="Zhao Y">Y Zhao</name>
</author>
<author>
<name sortKey="Hirst, M" uniqKey="Hirst M">M Hirst</name>
</author>
<author>
<name sortKey="Schein, Je" uniqKey="Schein J">JE Schein</name>
</author>
<author>
<name sortKey="Horsman, De" uniqKey="Horsman D">DE Horsman</name>
</author>
<author>
<name sortKey="Connors, Jm" uniqKey="Connors J">JM Connors</name>
</author>
<author>
<name sortKey="Gascoyne, Rd" uniqKey="Gascoyne R">RD Gascoyne</name>
</author>
<author>
<name sortKey="Marra, Ma" uniqKey="Marra M">MA Marra</name>
</author>
<author>
<name sortKey="Jones, Sj" uniqKey="Jones S">SJ Jones</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zerbino, Dr" uniqKey="Zerbino D">DR Zerbino</name>
</author>
<author>
<name sortKey="Birney, E" uniqKey="Birney E">E Birney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Luo, R" uniqKey="Luo R">R Luo</name>
</author>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Xie, Y" uniqKey="Xie Y">Y Xie</name>
</author>
<author>
<name sortKey="Li, Z" uniqKey="Li Z">Z Li</name>
</author>
<author>
<name sortKey="Huang, W" uniqKey="Huang W">W Huang</name>
</author>
<author>
<name sortKey="Yuan, J" uniqKey="Yuan J">J Yuan</name>
</author>
<author>
<name sortKey="He, G" uniqKey="He G">G He</name>
</author>
<author>
<name sortKey="Chen, Y" uniqKey="Chen Y">Y Chen</name>
</author>
<author>
<name sortKey="Pan, Q" uniqKey="Pan Q">Q Pan</name>
</author>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Tang, J" uniqKey="Tang J">J Tang</name>
</author>
<author>
<name sortKey="Wu, G" uniqKey="Wu G">G Wu</name>
</author>
<author>
<name sortKey="Zhang, H" uniqKey="Zhang H">H Zhang</name>
</author>
<author>
<name sortKey="Shi, Y" uniqKey="Shi Y">Y Shi</name>
</author>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Yu, C" uniqKey="Yu C">C Yu</name>
</author>
<author>
<name sortKey="Wang, B" uniqKey="Wang B">B Wang</name>
</author>
<author>
<name sortKey="Lu, Y" uniqKey="Lu Y">Y Lu</name>
</author>
<author>
<name sortKey="Han, C" uniqKey="Han C">C Han</name>
</author>
<author>
<name sortKey="Cheung, Dw" uniqKey="Cheung D">DW Cheung</name>
</author>
<author>
<name sortKey="Yiu, S M" uniqKey="Yiu S">S-M Yiu</name>
</author>
<author>
<name sortKey="Peng, S" uniqKey="Peng S">S Peng</name>
</author>
<author>
<name sortKey="Xiaoqian, Z" uniqKey="Xiaoqian Z">Z Xiaoqian</name>
</author>
<author>
<name sortKey="Liu, G" uniqKey="Liu G">G Liu</name>
</author>
<author>
<name sortKey="Liao, X" uniqKey="Liao X">X Liao</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
<author>
<name sortKey="Yang, H" uniqKey="Yang H">H Yang</name>
</author>
<author>
<name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
<author>
<name sortKey="Lam, T W" uniqKey="Lam T">T-W Lam</name>
</author>
<author>
<name sortKey="Wang, J" uniqKey="Wang J">J Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Miller, Jr" uniqKey="Miller J">JR Miller</name>
</author>
<author>
<name sortKey="Koren, S" uniqKey="Koren S">S Koren</name>
</author>
<author>
<name sortKey="Sutton, G" uniqKey="Sutton G">G Sutton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
<author>
<name sortKey="Almeida, J" uniqKey="Almeida J">J Almeida</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Polychronopoulos, D" uniqKey="Polychronopoulos D">D Polychronopoulos</name>
</author>
<author>
<name sortKey="Weitschek, E" uniqKey="Weitschek E">E Weitschek</name>
</author>
<author>
<name sortKey="Dimitrieva, S" uniqKey="Dimitrieva S">S Dimitrieva</name>
</author>
<author>
<name sortKey="Bucher, P" uniqKey="Bucher P">P Bucher</name>
</author>
<author>
<name sortKey="Felici, G" uniqKey="Felici G">G Felici</name>
</author>
<author>
<name sortKey="Almirantis, Y" uniqKey="Almirantis Y">Y Almirantis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author>
<name sortKey="Vitnyi, Pmb" uniqKey="Vitnyi P">PMB Vitnyi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Almeida, Js" uniqKey="Almeida J">JS Almeida</name>
</author>
<author>
<name sortKey="Vinga, S" uniqKey="Vinga S">S Vinga</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Giancarlo, R" uniqKey="Giancarlo R">R Giancarlo</name>
</author>
<author>
<name sortKey="Scaturro, D" uniqKey="Scaturro D">D Scaturro</name>
</author>
<author>
<name sortKey="Utro, F" uniqKey="Utro F">F Utro</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuksa, P" uniqKey="Kuksa P">P Kuksa</name>
</author>
<author>
<name sortKey="Pavlovic, V" uniqKey="Pavlovic V">V Pavlovic</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hide, W" uniqKey="Hide W">W Hide</name>
</author>
<author>
<name sortKey="Burke, J" uniqKey="Burke J">J Burke</name>
</author>
<author>
<name sortKey="Da Vison, Db" uniqKey="Da Vison D">DB Da Vison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Teeling, H" uniqKey="Teeling H">H Teeling</name>
</author>
<author>
<name sortKey="Meyerdiekers, A" uniqKey="Meyerdiekers A">A Meyerdiekers</name>
</author>
<author>
<name sortKey="Bauer, M" uniqKey="Bauer M">M Bauer</name>
</author>
<author>
<name sortKey="Glockner, Fo" uniqKey="Glockner F">FO Glöckner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pride, Dt" uniqKey="Pride D">DT Pride</name>
</author>
<author>
<name sortKey="Meinersmann, Rj" uniqKey="Meinersmann R">RJ Meinersmann</name>
</author>
<author>
<name sortKey="Wassenaar, Tm" uniqKey="Wassenaar T">TM Wassenaar</name>
</author>
<author>
<name sortKey="Blaser, Mj" uniqKey="Blaser M">MJ Blaser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Teeling, H" uniqKey="Teeling H">H Teeling</name>
</author>
<author>
<name sortKey="Waldmann, J" uniqKey="Waldmann J">J Waldmann</name>
</author>
<author>
<name sortKey="Lombardot, T" uniqKey="Lombardot T">T Lombardot</name>
</author>
<author>
<name sortKey="Bauer, M" uniqKey="Bauer M">M Bauer</name>
</author>
<author>
<name sortKey="Glockner, Fo" uniqKey="Glockner F">FO Glöckner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Trapnell, C" uniqKey="Trapnell C">C Trapnell</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langmead, B" uniqKey="Langmead B">B Langmead</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Needleman, Sb" uniqKey="Needleman S">SB Needleman</name>
</author>
<author>
<name sortKey="Wunsch, Cd" uniqKey="Wunsch C">CD Wunsch</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Altschul, S" uniqKey="Altschul S">S Altschul</name>
</author>
<author>
<name sortKey="Gish, W" uniqKey="Gish W">W Gish</name>
</author>
<author>
<name sortKey="Miller, W" uniqKey="Miller W">W Miller</name>
</author>
<author>
<name sortKey="Myers, E" uniqKey="Myers E">E Myers</name>
</author>
<author>
<name sortKey="Lipman, D" uniqKey="Lipman D">D Lipman</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fawcett, T" uniqKey="Fawcett T">T Fawcett</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, K" uniqKey="Song K">K Song</name>
</author>
<author>
<name sortKey="Ren, J" uniqKey="Ren J">J Ren</name>
</author>
<author>
<name sortKey="Zhai, Z" uniqKey="Zhai Z">Z Zhai</name>
</author>
<author>
<name sortKey="Liu, X" uniqKey="Liu X">X Liu</name>
</author>
<author>
<name sortKey="Deng, M" uniqKey="Deng M">M Deng</name>
</author>
<author>
<name sortKey="Sun, F" uniqKey="Sun F">F Sun</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Res Notes</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Res Notes</journal-id>
<journal-title-group>
<journal-title>BMC Research Notes</journal-title>
</journal-title-group>
<issn pub-type="epub">1756-0500</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25465386</article-id>
<article-id pub-id-type="pmc">4265526</article-id>
<article-id pub-id-type="publisher-id">3381</article-id>
<article-id pub-id-type="doi">10.1186/1756-0500-7-869</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Next generation sequencing reads comparison with an alignment-free distance</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Weitschek</surname>
<given-names>Emanuel</given-names>
</name>
<address>
<email>emanuel@dia.uniroma3.it</email>
</address>
<xref ref-type="aff" rid="Aff9"></xref>
<xref ref-type="aff" rid="Aff10"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Santoni</surname>
<given-names>Daniele</given-names>
</name>
<address>
<email>daniele.santoni@iasi.cnr.it</email>
</address>
<xref ref-type="aff" rid="Aff10"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Fiscon</surname>
<given-names>Giulia</given-names>
</name>
<address>
<email>fiscon@dis.uniroma1.it</email>
</address>
<xref ref-type="aff" rid="Aff10"></xref>
<xref ref-type="aff" rid="Aff11"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>De Cola</surname>
<given-names>Maria Cristina</given-names>
</name>
<address>
<email>cristina.decola@gmail.com</email>
</address>
<xref ref-type="aff" rid="Aff12"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bertolazzi</surname>
<given-names>Paola</given-names>
</name>
<address>
<email>paola.bertolazzi@iasi.cnr.it</email>
</address>
<xref ref-type="aff" rid="Aff10"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Felici</surname>
<given-names>Giovanni</given-names>
</name>
<address>
<email>giovanni.felici@iasi.cnr.it</email>
</address>
<xref ref-type="aff" rid="Aff10"></xref>
</contrib>
<aff id="Aff9">
<label></label>
Department of Engineering, Roma Tre University, Via della Vasca Navale 79, 00146 Rome, Italy</aff>
<aff id="Aff10">
<label></label>
Institute of Systems Analysis and Computer Science “A. Ruberti”, National Research Council, Via dei Taurini 19, 00185 Rome, Italy</aff>
<aff id="Aff11">
<label></label>
Department of Computer, Control, and Management Engineering “Antonio Ruberti”, Viale Ariosto 25, 00185 Rome, Italy</aff>
<aff id="Aff12">
<label></label>
IRCCS Centro Neurolesi “Bonino-Pulejo”, S.S.113 Via Palermo C/da Casazza, 98123 Messina, Italy</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>3</day>
<month>12</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>3</day>
<month>12</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="collection">
<year>2014</year>
</pub-date>
<volume>7</volume>
<elocation-id>869</elocation-id>
<history>
<date date-type="received">
<day>5</day>
<month>5</month>
<year>2014</year>
</date>
<date date-type="accepted">
<day>20</day>
<month>11</month>
<year>2014</year>
</date>
</history>
<permissions>
<copyright-statement>© Weitschek et al.; licensee BioMed Central Ltd. 2014</copyright-statement>
<license license-type="open-access">
<license-p>This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">http://creativecommons.org/licenses/by/4.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p>Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (
<italic>reads</italic>
). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis.</p>
</sec>
<sec>
<title>Methods</title>
<p>We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (
<italic>k</italic>
-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance. The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from
<italic>Saccharomyces cerevisiae</italic>
,
<italic>Escherichia coli</italic>
, and
<italic>Homo sapiens</italic>
. The comparison is based on the correctness of threshold predictors cross-validated over different samples.</p>
</sec>
<sec>
<title>Results</title>
<p>We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).</p>
</sec>
<sec>
<title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/1756-0500-7-869) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Sequence analysis</kwd>
<kwd>Next generation sequencing</kwd>
<kwd>Alignment-free</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2014</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p>The development of Next Generation Sequencing (NGS) machines allows the extraction of an extremely large amount of reads (i.e., short fragments of an organism’s genome) at low cost. The length of such reads is very small when compared to the length of a genome: it may range from 40 to 300 base pairs (bp) (i.e., characters), while the length of a simple genome (e.g., bacteria) is in the order of millions base pairs (Mbp). Three main NGS technologies are currently used [
<xref ref-type="bibr" rid="CR1">1</xref>
]: Roche 454, Illumina, and Ion Torrent. At present, Illumina technology performances are 40 gigabase pairs (Gbp) per day at a low cost per bp [illumina.com] with reads average length of 70 bp; Roche 454 performances are 1 Gbp per day at a higher cost with reads average length of 250 bp [454.com]; Ion Torrent machines produce reads of 200 bp with a throughput of 5 Gbp per day at a low cost per bp [
<xref ref-type="bibr" rid="CR2">2</xref>
].</p>
<p>DNA assembly can be defined as the reconstruction of a genome, starting from a large number of short overlapped fragments (
<italic>reads</italic>
) obtained by a
<italic>sequencing</italic>
operation. The length of each read and the number of reads are determined by the type of sequencer. The complexity of the assembly process stems from the length and number of the reads: longer reads are easier to be assembled, while a larger number of short reads requires a higher computational effort, although providing more information. Typically, the number of reads produced by NGS experiments reaches several millions or more, depending on the sequencing coverage and on its depth. The use of NGS machines results in much larger sets of reads to be assembled, posing new problems for computer scientists and bioinformaticians, whose task is to design algorithms that align and merge the reads for an effective reconstruction of the genome (or large portions of it) with sufficient precision and speed [
<xref ref-type="bibr" rid="CR3">3</xref>
].</p>
<p>Many competing algorithms have been developed for DNA assembly: a comprehensive comparison of recent and well-established ones can be found in [
<xref ref-type="bibr" rid="CR4">4</xref>
] and [
<xref ref-type="bibr" rid="CR5">5</xref>
], where these methods are tested on common benchmarks. The assembly problem is proven to be NP-hard [
<xref ref-type="bibr" rid="CR6">6</xref>
] and several heuristic algorithms have been proposed. Algorithms for DNA assembly are based on two main approaches: overlap graphs (e.g., [
<xref ref-type="bibr" rid="CR7">7</xref>
]) and De Bruijn Graphs [
<xref ref-type="bibr" rid="CR4">4</xref>
]. In an overlap graph each read corresponds to a node, and the overlaps between read pairs - that define the weights of the arcs - are usually computed by means of alignment methods; an assembly is derived from an Hamiltonian path in this graph. In the De Bruijn Graphs approach, reads are represented on a graph whose nodes and edges are nucleotide subsequences of length
<italic>k</italic>
(called
<italic>k</italic>
-mers) [
<xref ref-type="bibr" rid="CR8">8</xref>
]; an edge corresponds to an overlap between two nodes. The assembly is found searching for an Eulerian cycle in this graph and it is represented by a sequence of edges. Several assembly software tools (e.g., ABySS [
<xref ref-type="bibr" rid="CR9">9</xref>
], Velvet [
<xref ref-type="bibr" rid="CR10">10</xref>
], and SoapDeNovo [
<xref ref-type="bibr" rid="CR11">11</xref>
]) use subsequences of fixed dimensions (
<italic>k</italic>
-mers) for building the De Bruijn graph. These and other well-established assembly algorithms (e.g., Ssake, Sharcgs, Vcake, Newbler, Celera Assembler, Euler, and AllPaths) are described and compared in [
<xref ref-type="bibr" rid="CR12">12</xref>
]. We note that the role of
<italic>k</italic>
-mers in the assembly approaches based on De Bruijn graph is substantially different from the role they play in the the definition of the alignment-free distance described later. In fact, the De Bruijn graph uses
<italic>k</italic>
-mers as nodes of the graph and does not consider their frequency, while our approach is based on the frequency of
<italic>k</italic>
-mers to assess reads similarity.</p>
<p>A large number of these algorithms - in particular, those using the overlap graphs - are based on the similarity between reads. Such a similarity is the main way to assess whether two reads may be overlapped in the reconstruction process or not. In these approaches, such a measure is hence required to compare each read pair, generating a number of comparisons that is potentially quadratic in the number of reads. Therefore, it is extremely important to develop methods that can quickly establish whether two reads are similar or not.</p>
<p>In this paper, we focus on alignment-free techniques that have been proven to be effective in sequence analysis [
<xref ref-type="bibr" rid="CR13">13</xref>
,
<xref ref-type="bibr" rid="CR14">14</xref>
]. These techniques can be classified into two main groups: methods based on sequence compression and methods that rely on subsequence (oligomers) frequencies [
<xref ref-type="bibr" rid="CR13">13</xref>
]. The aim of the methods belonging to the first group is to find the shortest possible description of the sequence. They compute the similarity of the sequences by analyzing their compressed representations. Currently available methods are based on the Kolmogorov complexity [
<xref ref-type="bibr" rid="CR15">15</xref>
] and on Universal Sequence Maps [
<xref ref-type="bibr" rid="CR16">16</xref>
]. An extensive review can be found in [
<xref ref-type="bibr" rid="CR17">17</xref>
]. The methods based on oligomers frequencies rely on the computation of the substring frequencies of a given length
<italic>k</italic>
in the original sequences, called
<italic>k</italic>
-mers. Here, the similarity of two sequences is based only on the dictionary of subsequences that appear in the strings, irrespective of their relative position [
<xref ref-type="bibr" rid="CR17">17</xref>
].</p>
<p>The alignment-free distance adopted in this study is inspired to the
<italic>k</italic>
-mer frequency analysis [
<xref ref-type="bibr" rid="CR18">18</xref>
], where the frequencies of the
<italic>k</italic>
-mers are represented in a real vector, and hence they are easily tractable in a mathematical space: the distance between two reads is obtained by the distance between their frequency vector representations. A simple and easy way to compute a distance measure is the Euclidean distance, although others may be used (e.g., the
<italic>d</italic>
2 distance of [
<xref ref-type="bibr" rid="CR19">19</xref>
]). The goal of this paper is to evaluate the reliability of an alignment-free distance for read pairs similarity and to compare it with respect to other read-to-read distances that are based on global or local alignment of the two reads.</p>
<p>The paper is organized as follows.</p>
<p>In section
<italic>Methods</italic>
, we provide sufficient background for the main methods and techniques used in the paper: the different adopted read pairs distances are described (subsections
<italic>Bowtie distance</italic>
,
<italic>Needleman-Wunsch edit distance</italic>
,
<italic>The Blast alignment distance</italic>
and
<italic>Alignment-free distance on tetramer frequencies</italic>
). Following, we outline the rationale of threshold predictors and the way they are computed from data. Section
<italic>Results and discussion</italic>
describes the experimental design and its results. First, we delineate the data sets extraction and the experimental procedure (subsection
<italic>Data sets and experimental settings</italic>
). Then, we consider the computational performances of the different distances (subsection
<italic>Computational time analysis of the threshold predictors</italic>
), how they correlate among each other (subsection
<italic>Pearson correlation among distances</italic>
), their prediction performances over the training sets with the support of ROC curves and AUC indicators (subsection
<italic>Performance analysis of the threshold predictors</italic>
), and their predictive results for a cross validation evaluation scheme (subsection
<italic>Cross validation performances of the</italic>
AF
<italic>threshold predictor</italic>
). Finally, we provide discussion of the results (subsection
<italic>Final discussion</italic>
). In section
<italic>Conclusions</italic>
we delineate the conclusions and the perspectives of the work.</p>
</sec>
<sec id="Sec2" sec-type="methods">
<title>Methods</title>
<p>We consider a straightforward implementation of the alignment-free distance, based on the euclidean distance of the frequency distribution of
<italic>k</italic>
-mers (i.e., substrings composed of
<italic>k</italic>
consecutive bases) in the two reads. Such a distance, referred to as AF in the following, is very simple to compute and requires linear time in the dimension of the reads. As far as the choice of the length of the oligomers, we adopt
<italic>k</italic>
=4 (tetramers) as in many references this value has been confirmed to provide an ideal balance between the length of the oligomers and their number, when the sequences are expressed in the (A, C, G, T) alphabet [
<xref ref-type="bibr" rid="CR20">20</xref>
<xref ref-type="bibr" rid="CR22">22</xref>
]. AF is compared with respect to two methods to measure DNA string similarity that are based on sequence alignment: the Needleman-Wunsch edit distance (NW) and the Blast alignment algorithm (BL).</p>
<p>Both methods require quadratic time in the length of the reads. Their choice is motivated by the fact that the first is a global alignment method, i.e., it searches for the best alignment of the complete reads, while the second is a local alignment, i.e., it searches for the longest possible portion that is aligned well within the two reads. Therefore, their choice covers the two main approaches used in computing alignment-based distances.</p>
<p>To perform a proper comparison among AF, NW, and BL we adopt the following test. First, we assume the existence of an
<italic>ideal distance</italic>
, i.e., the distance that is given by the degree of overlapping of reads that have been aligned on their known reference genome. Second, we verify the ability of the three distances in approximating this ideal (target) distance. Given a pair of reads, such an approximation is measured by the ability of predicting the value of the target distance using the value of the predicting one. This assumption is based on the fact that an assembly method that uses the target distance to evaluate the opportunity of overlapping two reads would result in a extremely satisfactory assembly.</p>
<p>To align the reads over the original sequence, we use the well-established Bowtie algorithm [
<xref ref-type="bibr" rid="CR23">23</xref>
,
<xref ref-type="bibr" rid="CR24">24</xref>
]; two reads receive a maximum distance value if they do not overlap over the reference sequence; otherwise, they receive a distance inversely proportional to their degree of overlapping over the sequence (e.g., they would have minimum distance if they are aligned in the same position by the Bowtie algorithm). Given two reads, we define such a value their
<italic>Bowtie distance</italic>
(BT in the following). We refer to BT as the
<italic>target</italic>
distance and either to AF, or NW or BL as the
<italic>predictor</italic>
distance. A
<italic>threshold predictor</italic>
is a mapping between the values of the target distance and the values of the predictor distance; in other words, it assigns to each value of the target distance, say
<italic>α</italic>
, a value of the predictor distance, say
<italic>β</italic>
. Informally, we may define the threshold predictor as a mapping
<italic>m</italic>
such that
<italic>m</italic>
(
<italic>α</italic>
) = 
<italic>β</italic>
. Then, the target distance between two given reads is predicted to be below
<italic>α</italic>
when the predictor distance between the same two reads is below
<italic>β</italic>
.</p>
<p>According to the above definition, for each value of the target distance the threshold predictor may incur in errors in terms of false positive and false negative predictions. The quality of the threshold predictor is given by the error distribution over the predicted values of the target distance.</p>
<p>To test our method, we consider DNA sequences of three different organisms:
<italic>Saccharomyces cerevisiae</italic>
,
<italic>Escherichia coli</italic>
, and
<italic>Homo sapiens</italic>
. Publicly available sets of NGS reads for the three reference sequences are used. Each experiment is based on a large sample of read pairs, from which the best possible threshold predictor (among AF, NW, or BL) of the BT distance value is computed. The precision of the predictors is evaluated building ROC curves both on the samples of read pairs used to identify the best predictors (training data), and on other samples from the same set of read pairs not used for training (testing data). The results show how AF performs very well as a threshold predictor for BT; its performances are indeed better than those of NW and comparable to those exhibited by BL. Furthermore, both NW and BL are much more demanding in terms of computing time when compared with respect to AF.</p>
<sec id="Sec3">
<title>Bowtie distance</title>
<p>The Bowtie distance (BT) is obtained after computing the alignments of the reads to the reference genome with the Bowtie algorithm [
<xref ref-type="bibr" rid="CR23">23</xref>
,
<xref ref-type="bibr" rid="CR24">24</xref>
]. Bowtie is able to align reads to the reference genome at a very high speed (25 million reads of 35 bp length per hour).</p>
<p>Prior to the computation of the alignments, Bowtie builds an index of the reference genome with a Burrows-Wheeler approach. Two versions of Bowtie are available: Bowtie 1 [
<xref ref-type="bibr" rid="CR23">23</xref>
] and Bowtie 2 [
<xref ref-type="bibr" rid="CR24">24</xref>
]. The first one is optimized for short genomes, the latter for longer ones and supports gapped, local, and paired-end alignment modes. For each read, the alignment position in the reference genome is obtained after running the Bowtie algorithm. Given two reads
<italic>r</italic>
<sub>1</sub>
,
<italic>r</italic>
<sub>2</sub>
we define the Bowtie distance as follows:
<disp-formula id="Equa">
<graphic xlink:href="13104_2014_3381_Equa_HTML.gif" position="anchor"></graphic>
</disp-formula>
</p>
<p>where ∇(
<italic>r</italic>
<sub>1</sub>
,
<italic>r</italic>
<sub>2</sub>
) is the number of overlapped positions of
<italic>r</italic>
<sub>1</sub>
and
<italic>r</italic>
<sub>2</sub>
,
<italic>λ</italic>
<sub>1</sub>
is the length of
<italic>r</italic>
<sub>1</sub>
, and
<italic>λ</italic>
<sub>2</sub>
is the length of
<italic>r</italic>
<sub>2</sub>
. If multiple alignments of the same reads are present, their average is used.</p>
</sec>
<sec id="Sec4">
<title>Needleman-Wunsch edit distance</title>
<p>The Needleman and Wunsch algorithm (NW) [
<xref ref-type="bibr" rid="CR25">25</xref>
], based on dynamic programming, is commonly used to perform a global alignment of two sequences. The algorithm time complexity is quadratic with respect to the lengths of the two sequences (
<italic>N</italic>
and
<italic>M</italic>
) to be aligned (
<italic>O</italic>
(
<italic>n</italic>
 ∗ 
<italic>m</italic>
), where
<italic>n</italic>
and
<italic>m</italic>
are the number of bases in the two reads). The NeoBio [
<xref ref-type="bibr" rid="CR26">26</xref>
] Java implementation of the NW algorithm is adopted for performing the distance evaluation experiments. We adopt following parameter settings for the NW algorithm:</p>
<p>
<list list-type="bullet">
<list-item>
<p> +1 for the reward of a match (i.e., a substitution of equal characters);</p>
</list-item>
<list-item>
<p> -1 for the penalty of a mismatch (i.e., a substitution of different characters);</p>
</list-item>
<list-item>
<p> -1 for the cost of a gap (i.e., an insertion or deletion of a character).</p>
</list-item>
</list>
</p>
<p>We use the above-mentioned configuration in order to assign an equally balanced score for a match (+1), a mismatch (-1), and a gap (-1). For further details we point the reader to the NeoBio documentation [
<xref ref-type="bibr" rid="CR26">26</xref>
]. The NW distance is obtained from the Needleman-Wunsch score in two steps. First, the score is subtracted to its maximum possible value (perfect alignment) in order to obtain null distance in case of equal sequences and large distance for different ones; then, it is normalized between 0 and 1 to ease the comparisons with the other measures.</p>
</sec>
<sec id="Sec5">
<title>The Blast alignment distance</title>
<p>The Basic Local Alignment Search Tool (Blast) [
<xref ref-type="bibr" rid="CR27">27</xref>
] is used to compare a query sequence with respect to a library or database of sequences. Blast adopts an heuristic approach that is less accurate than other methods, but much faster. The Blast time complexity is also quadratic (
<italic>O</italic>
(
<italic>n</italic>
 ∗ 
<italic>m</italic>
) where
<italic>n</italic>
and
<italic>m</italic>
are the lengths of the two reads to be aligned). It is worth noting that this is the same time complexity as other algorithms, including the NW global alignment. However, given the heuristic nature of the algorithm, the statistically significant elimination of High-scoring Segment Pairs (HSPs) and words is used. In this way, Blast significantly reduces the amount of computation, running much faster than its worst case time complexity. In this work, we use Blast2 that is the Blast version to simply align two sequences. The Blast implementation available in [
<xref ref-type="bibr" rid="CR28">28</xref>
] was adopted for computing the Blast scores and the Blast expected values between the considered read pairs. The parameters adopted for the runs are described in Table
<xref rid="Tab1" ref-type="table">1</xref>
: we turn off the masking parameter, which filters out low complexity and high frequency regions (e.g., repetitive parts) of the genomic sequence. The final Blast distance (BL) is obtained by subtracting the Blast score to its maximum value and normalizing it between 0 and 1 (given the fixed and equal size of the sequences, the Blast expected values resulted to be perfectly log-correlated with the Blast scores).
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>
<bold>Blast parameters setting</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Parameter</th>
<th align="left">Value</th>
<th align="left">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">-p</td>
<td align="left">blastn</td>
<td align="left">Blast program for nucleotide sequences</td>
</tr>
<tr>
<td align="left">-F</td>
<td align="left">F</td>
<td align="left">Masking and filtering off</td>
</tr>
<tr>
<td align="left">-w</td>
<td align="left">4</td>
<td align="left">Windows size</td>
</tr>
<tr>
<td align="left">-i</td>
<td align="left">r1.fas</td>
<td align="left">First input filename</td>
</tr>
<tr>
<td align="left">-j</td>
<td align="left">r2.fas</td>
<td align="left">Second input filename</td>
</tr>
<tr>
<td align="left">-m</td>
<td align="left">8</td>
<td align="left">Alignment view set to tabular output</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec6">
<title>Alignment-free distance on tetramer frequencies</title>
<p>We provide a simple sketch of the alignment-free distance computation used in this paper, mainly based on [
<xref ref-type="bibr" rid="CR13">13</xref>
]. The frequencies of each substring of length 4 (also called
<italic>tetramers</italic>
) are computed by counting the occurrences of the substrings in the read with a sliding window of length 4, starting at position 1 and ending at position
<italic>n</italic>
 - 4 + 1, where
<italic>n</italic>
is the length of the read. For the alphabet composed of the four symbols (A,C,G,T) we have a total of 4
<sup>4</sup>
 = 256 different tetramers and thus each read is represented by a vector of 256 real numbers between 0 and 1. The choice of tetramers is motivated by [
<xref ref-type="bibr" rid="CR20">20</xref>
<xref ref-type="bibr" rid="CR22">22</xref>
], which confirm the ideal balance between the length of the oligomers and their number. Given two reads, the Euclidean distance between their associated frequency vectors is an inverse measure of the similarity of the two reads, and we refer to it as the AF distance between the two reads. An efficient Java implementation of the alignment-free frequency vector computation and representation was developed for computing the AF distance between the available read pairs. The related algorithm is linear with respect to the length of the reads and is available at dmb.iasi.cnr.it/ngs_distances.php.</p>
</sec>
<sec id="Sec7">
<title>Threshold distance predictors</title>
<p>Our goal is to show how the AF distance between two DNA reads can approximate the BT distance, taking into account a tolerable degree of accuracy. In greater details, we aim to show that AF approximates BT as well as NW or BL distances, although less computationally demanding.</p>
<p>For a formal definition of
<italic>threshold predictor</italic>
, we need some additional notation. Let
<italic>r</italic>
<sub>1</sub>
and
<italic>r</italic>
<sub>2</sub>
be two generic reads coming from a DNA sequencing operation, and
<italic>d</italic>
<sub>1</sub>
(
<italic>r</italic>
<sub>1</sub>
,
<italic>r</italic>
<sub>2</sub>
),
<italic>d</italic>
<sub>2</sub>
(
<italic>r</italic>
<sub>1</sub>
,
<italic>r</italic>
<sub>2</sub>
) be two alternative read-to-read distance functions. Given a vector
<italic>α</italic>
of dimension
<italic>m</italic>
,
<italic>α</italic>
 = (
<italic>α</italic>
<sub>1</sub>
,
<italic>α</italic>
<sub>2</sub>
,…,
<italic>α</italic>
<sub>
<italic>m</italic>
</sub>
), a
<italic>threshold predictor</italic>
of
<italic>d</italic>
<sub>1</sub>
(·,·) by
<italic>d</italic>
<sub>2</sub>
(·,·) is determined by a vector
<italic>β</italic>
 = (
<italic>β</italic>
<sub>1</sub>
,
<italic>β</italic>
<sub>2</sub>
,…,
<italic>β</italic>
<sub>
<italic>m</italic>
</sub>
). Given two reads
<italic>r</italic>
<sub>1</sub>
,
<italic>r</italic>
<sub>2</sub>
, the prediction on
<italic>d</italic>
<sub>1</sub>
(·,·) is obtained by the following rule:
<disp-formula id="Equb">
<graphic xlink:href="13104_2014_3381_Equb_HTML.gif" position="anchor"></graphic>
</disp-formula>
</p>
<p>The threshold predictor depends on the choice of the value of
<italic>d</italic>
<sub>1</sub>
(·,·) in the vector
<italic>α</italic>
. We are indeed interested in the prediction only if
<italic>d</italic>
<sub>1</sub>
(
<italic>r</italic>
<sub>1</sub>
,
<italic>r</italic>
<sub>2</sub>
) is below a certain value based on the value of
<italic>d</italic>
<sub>2</sub>
(
<italic>r</italic>
<sub>1</sub>
,
<italic>r</italic>
<sub>2</sub>
), and would like this prediction to be precise for small reference values (i.e., those contained in the vector
<italic>α</italic>
).</p>
<p>As anticipated, we have BT as
<italic>target</italic>
distance (i.e.,
<italic>d</italic>
<sub>1</sub>
(·,·)) and the other distances as
<italic>predictors</italic>
(i.e.,
<italic>d</italic>
<sub>2</sub>
(·,·)). Since the original sequence is unknown, it is not always possible to compute the BT distance. Hence, we focused on predicting the BT distance by means of NW, BL or AF. In this context, a threshold predictor which is precise for small values in
<italic>α</italic>
turns out to be very useful. Thereby, we are not interested in predicting whether two reads are far from each other: we only want to know if they are close to each other or not.</p>
<p>A proper way to evaluate the quality of a threshold predictor is to measure its errors over one or more samples of read pairs where all distances are known. For each value
<italic>α</italic>
<sub>
<italic>i</italic>
</sub>
we measure the
<italic>true positive</italic>
(i.e., read pairs that are below
<italic>α</italic>
<sub>
<italic>i</italic>
</sub>
according to the target distance and are predicted to be below
<italic>α</italic>
<sub>
<italic>i</italic>
</sub>
by the threshold predictor) and
<italic>true negative</italic>
(i.e., read pairs that are above
<italic>α</italic>
<sub>
<italic>i</italic>
</sub>
according to the target distance and are predicted to be above
<italic>α</italic>
<sub>
<italic>i</italic>
</sub>
by the threshold predictor) rates associated with the above rule, and from this derive standard performance indicators such as ROC curves and AUC values [
<xref ref-type="bibr" rid="CR29">29</xref>
].</p>
<p>Read pair samples are used also to identify and test good threshold predictors. Given the value
<italic>α</italic>
 = (
<italic>α</italic>
<sub>1</sub>
,
<italic>α</italic>
<sub>2</sub>
,…,
<italic>α</italic>
<sub>
<italic>m</italic>
</sub>
), we compute for a sufficiently large set of candidate values
<italic>β</italic>
 = (
<italic>β</italic>
<sub>1</sub>
,
<italic>β</italic>
<sub>2</sub>
,…) the true positive and the true negative rates, construct the associated ROC curve for each value
<italic>α</italic>
<sub>
<italic>i</italic>
</sub>
, and derive the corresponding AUC value. If the AUC value is good enough, we identify the value
<italic>β</italic>
<sub>
<italic>j</italic>
</sub>
that provides the largest combination of true positive and true negative rates, and adopt that for
<italic>α</italic>
<sub>
<italic>i</italic>
</sub>
. Then, a complete threshold predictor is obtained by repeating this operation for each
<italic>α</italic>
<sub>
<italic>i</italic>
</sub>
,
<italic>i</italic>
 = 1,
<italic>m</italic>
.</p>
<p>The measure of a distinct precision value for each level of the target function enables to evaluate the reliability of the predictors there where it is needed. Clearly, the validity of a
<italic>threshold predictor</italic>
depends on the quality and the representativeness of the samples used to train (e.g., to derive the predicting vector
<italic>β</italic>
) and test the method. For the latter we adopt a standard cross-validation approach: first, the read pairs are sampled in disjoint sets; then some of these sets are used for training (e.g., derive the best values of
<italic>β</italic>
for the given values of
<italic>α</italic>
) and others are used for testing (measure the error of the so obtained threshold predictor). In several iterations, the role of training and testing samples is exchanged in order to mitigate the potential bias associated with the sampling.</p>
<p>The set of reference values
<italic>α</italic>
(for the target distance) and
<italic>β</italic>
(for the predictors) that have been used for the experiments are the values that separate the
<italic>percentiles</italic>
of the read pair distance distribution. In this way, we have that the first component of the
<italic>α</italic>
vector is larger than 1
<italic>%</italic>
of the sampled read pairs, the second is larger than 2
<italic>%</italic>
, and so on (similarly for the
<italic>β</italic>
vector). This allows to sample the whole variation range of the normalized distances, obtaining a finer granularity in the portions where the density is higher. According to this choice, both
<italic>α</italic>
and
<italic>β</italic>
are vectors composed of 100 real values between 0 and 1 in non-decreasing order. Clearly, this choice may be changed with equally spaced intervals without a significant effect on the results, once the proper granularity of the intervals has been identified.</p>
</sec>
<sec id="Sec8">
<title>Data sets and experimental settings</title>
<p>In this subsection, we describe the adopted NGS data sets and the experimental setting. Three different organisms are taken into account,
<italic>Saccharomyces cerevisiae</italic>
(commonly known as yeast),
<italic>Escherichia coli</italic>
, and
<italic>Homo sapiens</italic>
(commonly known as human).</p>
<p>We design and apply the following experimental procedure:</p>
<p>
<list list-type="bullet">
<list-item>
<p> 
<italic>N</italic>
reads are downloaded from the NCBI Sequence Read Archive (SRA) database [
<xref ref-type="bibr" rid="CR30">30</xref>
], or Chang Gung University, Department of Parasitology, College of Medicine (CGU) [
<xref ref-type="bibr" rid="CR31">31</xref>
], and different NGS platforms;</p>
</list-item>
<list-item>
<p> By using the Bowtie algorithm [
<xref ref-type="bibr" rid="CR23">23</xref>
,
<xref ref-type="bibr" rid="CR24">24</xref>
] the reads are aligned to the corresponding reference genome ( [
<xref ref-type="bibr" rid="CR32">32</xref>
,
<xref ref-type="bibr" rid="CR33">33</xref>
]);</p>
</list-item>
<list-item>
<p> From the resulting alignments a random selection of
<italic>rs</italic>
reads is computed and a reverse complemented representation is calculated, obtaining a total of
<italic>rtot</italic>
reads;</p>
</list-item>
<list-item>
<p> Out of all the possible pairs of different reads from this set, six subsets are selected, each with
<italic>rp</italic>
read pairs. In order to have half of the set with non overlapping reads (e.g., maximum Bowtie distance) and the other half with a variable degree of overlap, the random selection of these six subsets is controlled;</p>
</list-item>
<list-item>
<p> The four distances are then computed for each pair in the set: the Bowtie Distance BT, the Needleman and Wunsch edit distance NW, the Blast score BL, and the alignment-free distance AF over the tetramers (i.e., substrings of length 4);</p>
</list-item>
<list-item>
<p> These measures are all turned into proper distances (see section Methods) ranging from 0 to 1 with 0 corresponding to equal sreads and 1 corresponding to maximally different reads.</p>
</list-item>
</list>
</p>
<p>In the following the six datasets are referred as YA, YB, …, YF (Y as in yeast), EA, EB, …, EF (E as in E. coli), HA, HB, …, HF (H as in human).</p>
<p>A compact summarized overview and description of the data sets is given in Table
<xref rid="Tab2" ref-type="table">2</xref>
.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>
<bold>Compact overview of the datasets</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th colspan="3">Datasets</th>
</tr>
<tr>
<th align="left"></th>
<th>Yeast</th>
<th>E. coli</th>
<th>Human</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<bold>Genome length</bold>
</td>
<td align="center">12.1 Mb</td>
<td align="center">4.6 Mb</td>
<td align="center">3.2 Gb</td>
</tr>
<tr>
<td align="left">
<bold>Sequencing machine</bold>
</td>
<td align="center">Illumina HiSeq</td>
<td align="center">Roche 454</td>
<td align="center">Illumina GA II</td>
</tr>
<tr>
<td align="left">
<bold>Database</bold>
</td>
<td align="center">NCBI SRA</td>
<td align="center">CGU</td>
<td align="center">NCBI SRA</td>
</tr>
<tr>
<td align="left">
<bold>Accession number</bold>
</td>
<td align="center">ERX191563</td>
<td align="center">-</td>
<td align="center">SRX013970</td>
</tr>
<tr>
<td align="left">
<bold>Run id</bold>
</td>
<td align="center">ERR216898</td>
<td align="center">-</td>
<td align="center">SRR031057</td>
</tr>
<tr>
<td align="left">
<bold>Number of downloaded reads (</bold>
<bold>
<italic>N</italic>
</bold>
<bold>)</bold>
</td>
<td align="center">3,551,079</td>
<td align="center">436,142</td>
<td align="center">14,267,012</td>
</tr>
<tr>
<td align="left">
<bold>Avg. reads length</bold>
<bold>
<italic>±</italic>
</bold>
<bold>st.dev</bold>
</td>
<td align="center">100 ±6</td>
<td align="center">235 ±4</td>
<td align="center">75 ±5</td>
</tr>
<tr>
<td align="left">
<bold>Total base pairs</bold>
</td>
<td align="center">355.0 M</td>
<td align="center">102.5 M</td>
<td align="center">1.1 G</td>
</tr>
<tr>
<td align="left">
<bold>Random selection of aligned reads (</bold>
<bold>
<italic>rs</italic>
</bold>
<bold>)</bold>
</td>
<td align="center">54,860</td>
<td align="center">100,000</td>
<td align="center">183,672</td>
</tr>
<tr>
<td align="left">
<bold>Total number of selected reads (</bold>
<bold>
<italic>rtot</italic>
</bold>
<bold>)</bold>
</td>
<td align="center">109,720</td>
<td align="center">200,000</td>
<td align="center">367,344</td>
</tr>
<tr>
<td align="left">
<bold>Read pairs in each subset</bold>
 
<bold>
<italic>rp</italic>
</bold>
</td>
<td align="center">1 M</td>
<td align="center">200,000</td>
<td align="center">1 M</td>
</tr>
<tr>
<td align="left">
<bold>Source chromosome</bold>
</td>
<td align="center">chr1</td>
<td align="center">-</td>
<td align="center">chr1</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>It is worth noting that the choice from different platforms (Roche 454, Illumina GA II, and Illumina HiSeq) stems from three main reasons: first, we want to test our methods on different read lengths; second we aim to show that the performances of our distance are independent from the selected sequencing technology; lastly, these platforms are the most common ones.</p>
</sec>
</sec>
<sec id="Sec9">
<title>Results and discussion</title>
<p>The main goal of this work is to provide evidence that the
<italic>AF</italic>
distance is a suitable approach to approximate BT distance. We apply AF to the three different data sets described in subsection Data sets and experimental settings, showing the performances with respect to other two alignment based algorithms (NW, BL). As a first step, we compute the Pearson correlation coefficients among the distances in the read pairs samples; then, we compute the ability of each measure to predict BT distance at given BT thresholds, by ROC curves and the corresponding AUC values. We verify the consistency of the predictions by a cross validation scheme detailed in the following sections.</p>
<sec id="Sec10">
<title>Computational time analysis of the threshold predictors</title>
<p>All software implementations of BT, NW, BL, and AF are run under a 64 bit linux environment (kernel 2.6.26-2-amd64) with a 64 bit Oracle Java Virtual Machine (version 1.7.0_09) on an Intel Core i7 920 2.67 GHz processor with 24 GB RAM memory, 1 TB sata 7200rpm hard disk, and a Debian GNU Linux 5.0.10 operating system.</p>
<p>In Table
<xref rid="Tab3" ref-type="table">3</xref>
we show the computational time of the different threshold predictors on 1.00 E+4, 1.00 E+6, and 1.00 E+8 read pairs. The time was recorded by means of the
<italic>time</italic>
linux utility, which provides user (i.e., the actual cpu time spent for computation), system (i.e., time spent for system calls, e.g., input/output), and elapsed (i.e., real time between invocation and termination) times.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>
<bold>Computational time analysis</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Reads pairs</th>
<th colspan="3">1.00 E+4</th>
<th colspan="3">1.00 E+6</th>
<th colspan="3">1.00 E+8</th>
</tr>
<tr>
<th align="left">Time [sec]</th>
<th>User</th>
<th>System</th>
<th>Elapsed</th>
<th>User</th>
<th>System</th>
<th>Elapsed</th>
<th>User</th>
<th>System</th>
<th>Elapsed</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">AF </td>
<td align="center">2.09</td>
<td align="center">0.44</td>
<td align="center">1.95</td>
<td align="center">91.42</td>
<td align="center">36.89</td>
<td align="center">126.96</td>
<td align="center">10230.33</td>
<td align="center">3530.08</td>
<td align="center">13541.00</td>
</tr>
<tr>
<td align="left">NW </td>
<td align="center">11.98</td>
<td align="center">0.42</td>
<td align="center">11.74</td>
<td align="center">1466.49</td>
<td align="center">34.03</td>
<td align="center">1501.03</td>
<td align="center">362794.34</td>
<td align="center">3686.95</td>
<td align="center">365785.00</td>
</tr>
<tr>
<td align="left">Blast </td>
<td align="center">122.21</td>
<td align="center">81.14</td>
<td align="center">199.80</td>
<td align="center">10203.07</td>
<td align="center">7434.90</td>
<td align="center">18747.00</td>
<td align="center">1418254.45</td>
<td align="center">946049.60</td>
<td align="center">2365124.00</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>The results highlight the much lower running time requirements of the AF distance. From our experimental results we see that the running time of AF is approximately 10 times smaller than NW, that is in turn 10 times smaller than BL.</p>
<p>Regarding memory requirements and consumption, we note that if the algorithms compute the distance measures by loading all
<italic>n</italic>
 ∗ 
<italic>n</italic>
read pairs of average length
<italic>l</italic>
into memory, then they will require
<italic>l</italic>
 ∗ 
<italic>n</italic>
bytes (online implementation); else if they perform the computation separately for each read pair, then the memory consumption is 2
<italic>l</italic>
bytes (offline implementation).</p>
</sec>
<sec id="Sec11">
<title>Pearson correlation among distances</title>
<p>An initial comparison among the four distances is based on the analysis of the correlation coefficients of one distance with the others, over a sufficiently large sample of read pairs. The matrices in Tables
<xref rid="Tab4" ref-type="table">4</xref>
,
<xref rid="Tab5" ref-type="table">5</xref>
and
<xref rid="Tab6" ref-type="table">6</xref>
report the Pearson correlation values between the four read-to-read distances for the three organisms.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>
<bold>Pearson correlation matrix between the four read-to-read distances for Yeast - YA</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th>BT</th>
<th>NW</th>
<th>BL</th>
<th>AF</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<bold>BT</bold>
</td>
<td align="center">1.00</td>
<td align="center">0.45</td>
<td align="center">0.81</td>
<td align="center">0.63</td>
</tr>
<tr>
<td align="left">
<bold>NW</bold>
</td>
<td align="center">0.45</td>
<td align="center">1.00</td>
<td align="center">0.48</td>
<td align="center">0.52</td>
</tr>
<tr>
<td align="left">
<bold>BL</bold>
</td>
<td align="center">0.81</td>
<td align="center">0.48</td>
<td align="center">1.00</td>
<td align="center">0.61</td>
</tr>
<tr>
<td align="left">
<bold>AF</bold>
</td>
<td align="center">0.63</td>
<td align="center">0.52</td>
<td align="center">0.61</td>
<td align="center">1.00</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>
<bold>Pearson correlation matrix between the four read-to-read distances for Human - HA</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th>BT</th>
<th>NW</th>
<th>BL</th>
<th>AF</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<bold>BT</bold>
</td>
<td align="center">1.00</td>
<td align="center">0.68</td>
<td align="center">0.72</td>
<td align="center">0.67</td>
</tr>
<tr>
<td align="left">
<bold>NW</bold>
</td>
<td align="center">0.68</td>
<td align="center">1.00</td>
<td align="center">0.73</td>
<td align="center">0.72</td>
</tr>
<tr>
<td align="left">
<bold>BL</bold>
</td>
<td align="center">0.72</td>
<td align="center">0.73</td>
<td align="center">1.00</td>
<td align="center">0.63</td>
</tr>
<tr>
<td align="left">
<bold>AF</bold>
</td>
<td align="center">0.67</td>
<td align="center">0.72</td>
<td align="center">0.63</td>
<td align="center">1.00</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="Tab6">
<label>Table 6</label>
<caption>
<p>
<bold>Pearson correlation matrix between the four read-to-read distances for E. coli -EA</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th>BT</th>
<th>NW</th>
<th>BL</th>
<th>AF</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<bold>BT</bold>
</td>
<td align="center">1.00</td>
<td align="center">0.76</td>
<td align="center">1.00</td>
<td align="center">0.95</td>
</tr>
<tr>
<td align="left">
<bold>NW</bold>
</td>
<td align="center">0.76</td>
<td align="center">1.00</td>
<td align="center">0.76</td>
<td align="center">0.82</td>
</tr>
<tr>
<td align="left">
<bold>BL</bold>
</td>
<td align="center">1.00</td>
<td align="center">0.76</td>
<td align="center">1.00</td>
<td align="center">0.95</td>
</tr>
<tr>
<td align="left">
<bold>AF</bold>
</td>
<td align="center">0.95</td>
<td align="center">0.82</td>
<td align="center">0.95</td>
<td align="center">1.00</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>For each organism, the correlations are computed in one of the six available samples. Similar results are obtained on the other samples (not shown). It is of interest to analyze the correlation of the predictor distances (NW, BL, AF) with the target distance BT. First, we note that the correlations measures are significantly different in the three organisms; in yeast the NW distance is extremely poorly correlated with BT, while such a correlation improves for E. coli and human; the correlation between AF and BT is also weaker in yeast than in the other two organisms. The BL correlation with BT appears to be the higher among the three predictors. Moreover, it is evident that BT distance is almost perfectly reproduced from the predictors BL and AF in E. coli, then followed by yeast and human.</p>
<p>We could not found our conclusions only on the correlations values. Indeed, the requirement of a linear dependence between target and prediction distances may be a biased condition for the existence of the BT threshold predictor. Correlation represents an average similarity over the whole scale of the target, whereas some (small) reference values of the target distance need to be predicted with higher accuracy. Hence, more appropriate evaluations reported on the following sections are used.</p>
</sec>
<sec id="Sec12">
<title>Performance analysis of the threshold predictors</title>
<p>In this section, we analyze the performances of the three predictors on the target distance. As above-mentioned, we consider 100 intervals of the target BT distance corresponding to the percentiles of its distribution, and identify by exhaustive inspection the percentiles of the predictor distance that minimize the prediction error. Such an analysis is performed by means of ROC curves, where we plot the true positive rate against the false positive rate for a given percentile of the target distance, with the percentiles of the predictor distance varying from 1 to 100. We recall that an ideal ROC curve contains the point (0,1) and therefore the area under the ROC curve (AUC) has value 1. Smaller values of AUC represent poorer prediction performances, and, in general, an AUC value is usually considered very good when in the proximity of 0.9.</p>
<p>We start presenting the ROC curves for the four values 0.10, 0.15, 0.20, and 0.25 of the target distance percentiles that correspond to determined values of BT. In Figures
<xref rid="Fig1" ref-type="fig">1</xref>
,
<xref rid="Fig2" ref-type="fig">2</xref>
and
<xref rid="Fig3" ref-type="fig">3</xref>
the ROC curves for predictors NW, BL, and AF are reported for the four reference values and for the three organisms. As one can observe, both AF (green, solid) and BL (blue, dotted) curves perform much better than NW (red, dashed). Figure
<xref rid="Fig1" ref-type="fig">1</xref>
depicts the ROC curves related to yeast and shows how both AF and BL perform much better than NW with AUC values higher than 0.9, while NW curves have AUC much smaller values (close to 0.7). Figure
<xref rid="Fig2" ref-type="fig">2</xref>
is related to E. coli and shows a very stable scenario: all the three measures are able to precisely predict BT for all the thresholds reaching values of AUC close to 1. Figure
<xref rid="Fig3" ref-type="fig">3</xref>
is related to human and shows that AF performs slightly better than NW that in turn performs slightly better than BL. AUC values of AF are close to 0.95, while those of NW are around 0.91, and those of BL range from 0.9 to 0.88. A more comprehensive outlook of the performances of the three predictors can be glanced from the three panels in Figure
<xref rid="Fig4" ref-type="fig">4</xref>
. Here we report the AUC values for all the 100 percentiles of the target distance, for three samples coming from yeast, E. coli, and human. Similar results are obtained when the other five samples from each organism are used (see Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
). The charts clearly show that for all three predictors the precision decreases for higher percentiles (i.e., larger values of the target distance). Lower percentiles (i.e., lower BT) correspond to higher level of overlapping, and we conclude that for these percentiles BT is easily predicted by the three measures.
<fig id="Fig1">
<label>Figure 1</label>
<caption>
<p>
<bold>ROC curves obtained for threshold predictors of yeast samples.</bold>
ROC curves obtained from threshold predictors of four different reference values of the target BT distance (0.10, 0.15, 0.20, 0.25). The charts report the ROC curves for the three predictor distances (NW, BL, AF). Results are provided on samples for yeast (YA).</p>
</caption>
<graphic xlink:href="13104_2014_3381_Fig1_HTML" id="d30e2473"></graphic>
</fig>
<fig id="Fig2">
<label>Figure 2</label>
<caption>
<p>
<bold>ROC curves obtained for threshold predictors of E. coli samples.</bold>
ROC curves obtained from threshold predictors of four different reference values of the target BT distance (0.10, 0.15, 0.20, 0.25). The charts report the ROC curves for the three predictor distances (NW, BL, AF). Results are provided on samples for E. coli (EA).</p>
</caption>
<graphic xlink:href="13104_2014_3381_Fig2_HTML" id="d30e2502"></graphic>
</fig>
<fig id="Fig3">
<label>Figure 3</label>
<caption>
<p>
<bold>ROC curves from threshold predictors of human samples.</bold>
ROC curves obtained from threshold predictors of four different reference values of the target BT distance (0.10, 0.15, 0.20, 0.25). The charts report the ROC curves for the three predictor distances (NW, BL, AF). Results are provided on samples for human (HA).</p>
</caption>
<graphic xlink:href="13104_2014_3381_Fig3_HTML" id="d30e2530"></graphic>
</fig>
<fig id="Fig4">
<label>Figure 4</label>
<caption>
<p>
<bold>AUC values for target percentile values.</bold>
AUC values for each percentile value of the target BT distance. The three panels report AUC values for the three predictor distances (NW, BL, AF). Results are provided on samples for yeast (panel
<bold>A</bold>
, YA), E. coli (panel
<bold>B</bold>
, EA), and human (panel
<bold>C</bold>
, HA).</p>
</caption>
<graphic xlink:href="13104_2014_3381_Fig4_HTML" id="d30e2573"></graphic>
</fig>
</p>
<p>The higher the percentiles (i.e., the smaller the overlapping), the higher the noise in the prediction will be. AUC values are indeed very high for smaller percentiles with the exception of NW in human. In Figure
<xref rid="Fig4" ref-type="fig">4</xref>
panel A (yeast), there is evidence that BL and AF have both good performances for all the percentiles in terms of AUC, showing values higher than 0.9, until percentile 30, and anyway values higher than 0.8 for the last percentiles. BL performs slightly better than AF until the percentile 20, then AF is better until the percentile 60 and again BL is better until the last percentiles. NW AUC values range from 0.72 until 0.68 showing again a light decreasing slope. Figure
<xref rid="Fig4" ref-type="fig">4</xref>
panel B (E. coli) shows - as previously highlighted - that the three measures have very good performances for this organism, with AUC values close to 1 until percentile 33. This means that all the three measures are able to correctly predict BT. From the percentile 33 AUC values of NW significantly decrease, while those of AF and BL are still close to 1 until the percentile 80, when they slowly decrease keeping anyway values higher than 0.9. In Figure
<xref rid="Fig4" ref-type="fig">4</xref>
panel C (human) it can be observed that the three curves start from an almost common AUC value, around 0.95, but diverge with increasing percentiles. AF has the best performance keeping its AUC values higher than 0.9, then NW decreasing until 0.85, and finally BL falling down to 0.55.</p>
</sec>
<sec id="Sec13">
<title>Cross validation performances of the AFthreshold predictor</title>
<p>The results discussed above show that the BL and AF predictors perform well when they are evaluated on the same sample that has been used to train the predictors reference values. It is more interesting to verify if the relation between the predictor and the target distance, derived from a read pairs sample, maintains its validity also on other samples that were not used to train the method.</p>
<p>We restrict this analysis to the AF predictor and test the threshold predictor rules derived from one sample on the other five samples of the same organism, in a cross validation scheme. The results are summarized in Tables
<xref rid="Tab7" ref-type="table">7</xref>
,
<xref rid="Tab8" ref-type="table">8</xref>
and
<xref rid="Tab9" ref-type="table">9</xref>
for yeast, E. coli, and human samples, respectively. The positive and negative precision rates are reported for the four reference values of the target distance already used in Figures
<xref rid="Fig1" ref-type="fig">1</xref>
,
<xref rid="Fig2" ref-type="fig">2</xref>
and
<xref rid="Fig3" ref-type="fig">3</xref>
(0.10, 0.15, 0.20, and 0.25), for all the combinations of the cross validation scheme. In Figure
<xref rid="Fig5" ref-type="fig">5</xref>
we plot the same positive, negative, and total precision rates for all the 100 reference values over a single sample (Panel A for yeast, Panel B for E. coli, and Panel C for human). Similar results have been obtained also for the other five samples of the three organisms.
<table-wrap id="Tab7">
<label>Table 7</label>
<caption>
<p>
<bold>True positive (TP) and true negative (TN) rates for reference values of target distance BT when predicted by AF, for yeast</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Set used for</th>
<th colspan="2">0.105</th>
<th colspan="2">0.15</th>
<th colspan="2">0.205</th>
<th colspan="2">0.25</th>
</tr>
<tr>
<th align="left">training the predictor</th>
<th>TP [%]</th>
<th>TN [%]</th>
<th>TP [%]</th>
<th>TN [%]</th>
<th>TP [%]</th>
<th>TN [%]</th>
<th>TP [%]</th>
<th>TN [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">YA</td>
<td align="center">90.26</td>
<td align="center">85.05</td>
<td align="center">90.00</td>
<td align="center">85.80</td>
<td align="center">89.13</td>
<td align="center">86.76</td>
<td align="center">87.96</td>
<td align="center">87.57</td>
</tr>
<tr>
<td align="left">YB</td>
<td align="center">90.03</td>
<td align="center">85.23</td>
<td align="center">89.75</td>
<td align="center">86.01</td>
<td align="center">89.69</td>
<td align="center">86.20</td>
<td align="center">87.94</td>
<td align="center">87.61</td>
</tr>
<tr>
<td align="left">YC</td>
<td align="center">90.01</td>
<td align="center">85.28</td>
<td align="center">89.72</td>
<td align="center">86.06</td>
<td align="center">88.44</td>
<td align="center">87.34</td>
<td align="center">87.39</td>
<td align="center">88.09</td>
</tr>
<tr>
<td align="left">YD</td>
<td align="center">90.23</td>
<td align="center">85.03</td>
<td align="center">89.74</td>
<td align="center">86.03</td>
<td align="center">89.12</td>
<td align="center">86.76</td>
<td align="center">88.58</td>
<td align="center">86.94</td>
</tr>
<tr>
<td align="left">YE</td>
<td align="center">90.73</td>
<td align="center">84.52</td>
<td align="center">90.64</td>
<td align="center">85.11</td>
<td align="center">89.05</td>
<td align="center">86.77</td>
<td align="center">87.87</td>
<td align="center">87.57</td>
</tr>
<tr>
<td align="left">YF</td>
<td align="center">92.22</td>
<td align="center">70.50</td>
<td align="center">91.41</td>
<td align="center">70.48</td>
<td align="center">91.38</td>
<td align="center">71.96</td>
<td align="center">89.95</td>
<td align="center">73.30</td>
</tr>
<tr>
<td align="left">
<bold>Average</bold>
</td>
<td align="center">90.58</td>
<td align="center">82.60</td>
<td align="center">90.21</td>
<td align="center">83.25</td>
<td align="center">89.47</td>
<td align="center">84.30</td>
<td align="center">88.28</td>
<td align="center">85.18</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Predictor is trained on one set and tested over the other 5 sets.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab8">
<label>Table 8</label>
<caption>
<p>
<bold>True positive (TP) and true negative (TN) rates for reference values of target distance BT when predicted by AF, for E. coli</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Set used for</th>
<th colspan="2">0.105</th>
<th colspan="2">0.15</th>
<th colspan="2">0.205</th>
<th colspan="2">0.25</th>
</tr>
<tr>
<th align="left">training the predictor</th>
<th>TP [%]</th>
<th>TN [%]</th>
<th>TP [%]</th>
<th>TN [%]</th>
<th>TP [%]</th>
<th>TN [%]</th>
<th>TP [%]</th>
<th>TN [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">EA</td>
<td align="center">97.00</td>
<td align="center">99.09</td>
<td align="center">96.38</td>
<td align="center">98.78</td>
<td align="center">95.43</td>
<td align="center">98.66</td>
<td align="center">94.61</td>
<td align="center">98.60</td>
</tr>
<tr>
<td align="left">EB</td>
<td align="center">99.08</td>
<td align="center">98.14</td>
<td align="center">98.90</td>
<td align="center">97.43</td>
<td align="center">98.54</td>
<td align="center">96.92</td>
<td align="center">97.97</td>
<td align="center">96.78</td>
</tr>
<tr>
<td align="left">EC</td>
<td align="center">100.00</td>
<td align="center">96.11</td>
<td align="center">99.98</td>
<td align="center">94.75</td>
<td align="center">99.98</td>
<td align="center">93.48</td>
<td align="center">99.96</td>
<td align="center">92.56</td>
</tr>
<tr>
<td align="left">ED</td>
<td align="center">98.67</td>
<td align="center">98.47</td>
<td align="center">98.30</td>
<td align="center">97.93</td>
<td align="center">97.95</td>
<td align="center">97.49</td>
<td align="center">97.49</td>
<td align="center">97.21</td>
</tr>
<tr>
<td align="left">EE</td>
<td align="center">98.31</td>
<td align="center">98.69</td>
<td align="center">97.42</td>
<td align="center">98.39</td>
<td align="center">97.15</td>
<td align="center">97.97</td>
<td align="center">96.57</td>
<td align="center">97.84</td>
</tr>
<tr>
<td align="left">EF</td>
<td align="center">95.46</td>
<td align="center">99.45</td>
<td align="center">94.15</td>
<td align="center">99.35</td>
<td align="center">93.46</td>
<td align="center">99.19</td>
<td align="center">91.43</td>
<td align="center">99.35</td>
</tr>
<tr>
<td align="left">
<bold>Average</bold>
</td>
<td align="center">98.09</td>
<td align="center">98.32</td>
<td align="center">97.52</td>
<td align="center">97.77</td>
<td align="center">97.09</td>
<td align="center">97.29</td>
<td align="center">96.34</td>
<td align="center">97.06</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Predictor is trained on one set and tested over the other 5 sets.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab9">
<label>Table 9</label>
<caption>
<p>
<bold>True positive (TP) and True negative (TN) rates for reference values of target distance BT when predicted by AF, for human</bold>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Set used for</th>
<th colspan="2">0.105</th>
<th colspan="2">0.15</th>
<th colspan="2">0.205</th>
<th colspan="2">0.25</th>
</tr>
<tr>
<th align="left">training the predictor</th>
<th>TP [%]</th>
<th>TN [%]</th>
<th>TP [%]</th>
<th>TN [%]</th>
<th>TP [%]</th>
<th>TN [%]</th>
<th>TP [%]</th>
<th>TN [%]</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">HA</td>
<td align="center">91.05</td>
<td align="center">82.53</td>
<td align="center">91.24</td>
<td align="center">83.24</td>
<td align="center">90.96</td>
<td align="center">84.74</td>
<td align="center">90.19</td>
<td align="center">86.25</td>
</tr>
<tr>
<td align="left">HB</td>
<td align="center">94.34</td>
<td align="center">78.82</td>
<td align="center">94.34</td>
<td align="center">79.58</td>
<td align="center">93.67</td>
<td align="center">81.78</td>
<td align="center">94.41</td>
<td align="center">81.35</td>
</tr>
<tr>
<td align="left">HC</td>
<td align="center">89.27</td>
<td align="center">84.35</td>
<td align="center">90.15</td>
<td align="center">84.32</td>
<td align="center">88.92</td>
<td align="center">86.24</td>
<td align="center">89.77</td>
<td align="center">86.63</td>
</tr>
<tr>
<td align="left">HD</td>
<td align="center">89.32</td>
<td align="center">84.28</td>
<td align="center">90.17</td>
<td align="center">84.27</td>
<td align="center">89.36</td>
<td align="center">86.24</td>
<td align="center">88.99</td>
<td align="center">87.24</td>
</tr>
<tr>
<td align="left">HE</td>
<td align="center">89.12</td>
<td align="center">84.22</td>
<td align="center">89.57</td>
<td align="center">84.53</td>
<td align="center">89.02</td>
<td align="center">86.37</td>
<td align="center">87.99</td>
<td align="center">87.82</td>
</tr>
<tr>
<td align="left">HF</td>
<td align="center">84.11</td>
<td align="center">85.02</td>
<td align="center">85.47</td>
<td align="center">83.32</td>
<td align="center">84.19</td>
<td align="center">84.04</td>
<td align="center">82.04</td>
<td align="center">85.48</td>
</tr>
<tr>
<td align="left">
<bold>Average</bold>
</td>
<td align="center">89.53</td>
<td align="center">83.20</td>
<td align="center">90.16</td>
<td align="center">83.21</td>
<td align="center">89.36</td>
<td align="center">84.90</td>
<td align="center">88.90</td>
<td align="center">85.80</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Predictor is trained on one set and tested over the other 5 sets.</p>
</table-wrap-foot>
</table-wrap>
<fig id="Fig5">
<label>Figure 5</label>
<caption>
<p>
<bold>Cross validation of threshold predictors.</bold>
Cross Validation precision rates of threshold predictors for each percentile value of the target BT distance; the panels report precision rates on positive, negative, and total. Results are provided on samples for yeast (panel
<bold>A</bold>
, YA used for training), E. coli (panel
<bold>B</bold>
, EA used for training), and human (panel
<bold>C</bold>
, HA used for training).</p>
</caption>
<graphic xlink:href="13104_2014_3381_Fig5_HTML" id="d30e3507"></graphic>
</fig>
</p>
<p>Figure
<xref rid="Fig5" ref-type="fig">5</xref>
shows that in yeast (panel A) positive precision rate ranges from a percentage of 88.28 (BT = 0.25) until 90.58 (BT = 0.105) while the negative precision rate ranges from 82.60 (BT = 0.105) until 85.18 (BT = 0.25). In E. coli (panel B) both positive and negative precision rates show values always higher than 96.34. In human (panel C) positive precision rate always is around a percentage of 90 and negative precision rate ranges from 83.07 (BT = 0.105) and 85.98 (BT = 0.25).</p>
<p>Tables
<xref rid="Tab7" ref-type="table">7</xref>
,
<xref rid="Tab8" ref-type="table">8</xref>
and
<xref rid="Tab9" ref-type="table">9</xref>
show positive, negative, and total precision rates for all the percentiles in the three organisms, revealing, as expected, that for E. coli (Table
<xref rid="Tab8" ref-type="table">8</xref>
) AF has very good performances also in the cross validation (total precision rate is always higher than 0.9), with a slight decreasing slope for the higher percentiles. In human and yeast (Table
<xref rid="Tab9" ref-type="table">9</xref>
and Table
<xref rid="Tab7" ref-type="table">7</xref>
, respectively), we have globally good performances, with a total precision rate ranging from 0.8 to 0.9.</p>
<p>The results described in Tables
<xref rid="Tab7" ref-type="table">7</xref>
,
<xref rid="Tab8" ref-type="table">8</xref>
and
<xref rid="Tab9" ref-type="table">9</xref>
and Figure
<xref rid="Fig5" ref-type="fig">5</xref>
confirm indeed that the AF predictor performs very well also in the cross validation scheme and exhibits good generalization capabilities. The parallel analysis conducted on the other two candidate predictors shows similar performances of BL and much poorer performances of NW (results shown in Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
).</p>
</sec>
<sec id="Sec14">
<title>Final discussion</title>
<p>The results clearly show the efficacy of the alignment-free distance in estimating a good read-to-read distance measure. The performances of AF in predicting BT are better than NW and at least comparable to BL, but the advantage of using AF is clear: it is linear in the size of the input and has a lower computational time. Indeed, as already discussed in subsection Computational time analysis of the threshold predictors, AF is much faster than NW and BL. As reported above, the prediction power of the three measures depends on the organism we consider, and we believe that this issue deserves further analysis and discussion.</p>
<p>We analyze two eukariotic genomes (yeast and human) and one bacterial one (E. coli); there is evidence that the performances of all the three predictors are globally much better in E. coli. This may be due to the nature of bacterial genomes, which are mostly composed of coding sequences, making easier to recognize overlapping regions and reducing the noise due to low complexity regions present in the intergenic eukariotic portions of genome.</p>
<p>An additional fact that deserves attention is that the distance-based on global alignment (NW), generally performs poorly with respect to the one based on local alignment (BL); the alignment-free distance (AF) seems to compare well with the local alignment one, despite it is based on the evaluation of the whole sequence, overcoming the bias that may derive from requiring the global alignment of the two reads.</p>
<p>Such a consideration is somehow strengthened by the different performances obtained on reads of different sizes; we recall that reads from human are smaller (average size 75 bases) than yeast and E. coli (100 and 235 bases, respectively); such a difference may explain the improved performances of the NW distance in human, as with shorter reads the advantage of local versus global alignment is reduced.</p>
</sec>
</sec>
<sec id="Sec15" sec-type="conclusions">
<title>Conclusions</title>
<p>In this paper, we described and tested a method to compare NGS DNA reads based on an alignment-free distance. We compared our method with respect to Blast and Needleman-Wunsch algorithms, which rely on an alignment-based approach. We designed our experiments in order to measure the potential contribution of the method in filtering DNA reads and speed up an assembly process.</p>
<p>We showed that the alignment-free distance outperformed the two aligned-based ones both in terms of computational time and of prediction performance, and conclude that an alignment-free distance may be used effectively for read pairs comparison.</p>
<p>In future, we plan to extend the reads comparison with other competitive methods and also with other alignment-free distances. The results shown in this paper are considered as a starting point to derive more efficient sequence similarity assessments methods for DNA reads obtained from NGS sequencing.</p>
<p>Finally, read pairs comparison based on alignment-free distances may be conveniently used in future for DNA assembly [
<xref ref-type="bibr" rid="CR34">34</xref>
] given its considerable speed, as well as for reads classification [
<xref ref-type="bibr" rid="CR35">35</xref>
], e.g., in metagenomics.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Electronic supplementary material</title>
<sec id="Sec16">
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="13104_2014_3381_MOESM1_ESM.xlsx">
<caption>
<p>Additional file 1:
<bold>Microsoft Excel 2010 spreadsheet that contains additional charts and data that are not shown in the paper.</bold>
It is divided into 6 sheets: • Sheet SP1 AUC for yeast samples contains the AUC plots for the different percentiles of the target distance and the different predictors for the six different samples of read pairs from yeast (YA-YF). • Sheet SP2 AUC for E. coli samples contains the AUC plots for the different percentiles of the target distance and the different predictors for the six different samples of read pairs from E. coli (EA-EF). • Sheet SP3 AUC for human samples contains the AUC plots for the different percentiles of the target distance and the different predictors for the six different samples of read pairs from human (HA-HF). • Sheet SP4 NW-BL-AF test precision contains the precision of the three predictors for the different percentiles of the target distance, expressed in terms of true positive, true negative and total precision rates, for the HA read pairs sample of human. • Sheet SP5 test-human contains the detailed test results of the cross validation on the six samples of human. • Sheet SP6 test-ecoli contains the detailed test results of the cross validation on the six samples of E. coli. (XLSX 1012 KB)</p>
</caption>
</media>
</supplementary-material>
</sec>
</sec>
</body>
<back>
<fn-group>
<fn>
<p>
<bold>Competing interests</bold>
</p>
<p>The authors declare that they have no competing interests.</p>
</fn>
<fn>
<p>
<bold>Authors’ contributions</bold>
</p>
<p>PB. directed research. GF. directed research, designed research, planned, and implemented part of the experiments. EW. designed and developed the AF distance. EW., GF., and GFi. conceived and performed the experiments. DS. provided mathematical validation, ROC analysis, and interpretation of results. EW., GF., DS., and GFi. wrote the manuscript. MCdC. performed statistical analysis and contributed to the design of the AF distance. All authors helped to draft and review the manuscript. All authors read and approved the final manuscript.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>We wish to thank Aleksandra Swiercz for introducing us in Next Generation Sequencing, Raffaele Giancarlo and Alberto Apostolico for sharing their knowledge about alignment-free algorithms with us.</p>
<p>The authors were partially supported by the Italian PRIN "GenData 2020" (2010RTFWBH), the FLAGSHIP "InterOmics" project (PB.P05), and by the cooperative program between the National Research Council of Italy (CNR) and the Polish Academy of Sciences (PAN).</p>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Eisenstein</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>The battle for sequencing supremacy</article-title>
<source>Nat Biotechnol</source>
<year>2012</year>
<volume>30</volume>
<issue>11</issue>
<fpage>1023</fpage>
<lpage>1026</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2412</pub-id>
<pub-id pub-id-type="pmid">23138289</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>N</given-names>
</name>
<name>
<surname>He</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Pong</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Law</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Comparison of next-generation sequencing systems</article-title>
<source>J Biomed Biotechnol</source>
<year>2012</year>
<fpage>251364</fpage>
</element-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Metzker</surname>
<given-names>ML</given-names>
</name>
</person-group>
<article-title>Sequencing technologies - the next generation</article-title>
<source>Nat Rev Genet</source>
<year>2010</year>
<volume>11</volume>
<issue>1</issue>
<fpage>31</fpage>
<lpage>46</lpage>
<pub-id pub-id-type="doi">10.1038/nrg2626</pub-id>
<pub-id pub-id-type="pmid">19997069</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Earl</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Bradnam</surname>
<given-names>K</given-names>
</name>
<name>
<surname>John</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Darling</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Fass</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>HOK</given-names>
</name>
<name>
<surname>Buffalo</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Diekhans</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Ariyaratne</surname>
<given-names>PN</given-names>
</name>
<name>
<surname>Sung</surname>
<given-names>W-K</given-names>
</name>
<name>
<surname>Ning</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Haimel</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Simpson</surname>
<given-names>JT</given-names>
</name>
<name>
<surname>Fonseca</surname>
<given-names>NA</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Docking</surname>
<given-names>TR</given-names>
</name>
<name>
<surname>Ho</surname>
<given-names>IY</given-names>
</name>
<name>
<surname>Rokhsar</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Lavenier</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Chapuis</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Naquin</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Maillet</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Schatz</surname>
<given-names>MC</given-names>
</name>
<name>
<surname>Kelley</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Phillippy</surname>
<given-names>AM</given-names>
</name>
<name>
<surname>Koren</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Nguyen</surname>
<given-names>N</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Assemblathon 1: a competitive assessment of de novo short read assembly methods</article-title>
<source>Genome Res</source>
<year>2011</year>
<volume>21</volume>
<issue>12</issue>
<fpage>2224</fpage>
<lpage>2241</lpage>
<pub-id pub-id-type="doi">10.1101/gr.126599.111</pub-id>
<pub-id pub-id-type="pmid">21926179</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bradnam</surname>
<given-names>KR</given-names>
</name>
<name>
<surname>Fass</surname>
<given-names>JN</given-names>
</name>
<name>
<surname>Alexandrov</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Baranay</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Bechner</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Boisvert</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Chapman</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Chapuis</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Chikhi</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Chitsaz</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>W-C</given-names>
</name>
<name>
<surname>Corbeil</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Fabbro</surname>
<given-names>CD</given-names>
</name>
<name>
<surname>Docking</surname>
<given-names>TR</given-names>
</name>
<name>
<surname>Durbin</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Earl</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Emrich</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Fedotov</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Fonseca</surname>
<given-names>NA</given-names>
</name>
<name>
<surname>Ganapathy</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Gibbs</surname>
<given-names>RA</given-names>
</name>
<name>
<surname>Gnerre</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Godzaridis</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Goldstein</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Haimel</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Haussler</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Hiatt</surname>
<given-names>JB</given-names>
</name>
<name>
<surname>Ho</surname>
<given-names>IY</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species</article-title>
<source>GigaScience</source>
<year>2013</year>
<volume>2</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>31</lpage>
<pub-id pub-id-type="doi">10.1186/2047-217X-2-10</pub-id>
<pub-id pub-id-type="pmid">23587291</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nagarajan</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Parametric complexity of sequence assembly: theory and applications to next generation sequencing</article-title>
<source>J Comput Biol</source>
<year>2009</year>
<volume>16</volume>
<issue>7</issue>
<fpage>897</fpage>
<lpage>908</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2009.0005</pub-id>
<pub-id pub-id-type="pmid">19580519</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Blazewicz</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bryja</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Figlerowicz</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gawron</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Kasprzak</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kirton</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Platt</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Przybytek</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Swiercz</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Szajkowski</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Whole genome assembly from 454 sequencing output via modified dna graph concept</article-title>
<source>Comput Biol Chem</source>
<year>2009</year>
<volume>33</volume>
<issue>3</issue>
<fpage>224</fpage>
<lpage>230</lpage>
<pub-id pub-id-type="doi">10.1016/j.compbiolchem.2009.04.005</pub-id>
<pub-id pub-id-type="pmid">19477687</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Compeau</surname>
<given-names>PEC</given-names>
</name>
<name>
<surname>Pevzner</surname>
<given-names>PA</given-names>
</name>
<name>
<surname>Tesler</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>How to apply de bruijn graphs to genome assembly</article-title>
<source>Nat Biotechnol</source>
<year>2011</year>
<volume>29</volume>
<issue>11</issue>
<fpage>987</fpage>
<lpage>991</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2023</pub-id>
<pub-id pub-id-type="pmid">22068540</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Birol</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Jackman</surname>
<given-names>SD</given-names>
</name>
<name>
<surname>Nielsen</surname>
<given-names>CB</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>JQ</given-names>
</name>
<name>
<surname>Varhol</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Stazyk</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Morin</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Hirst</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Schein</surname>
<given-names>JE</given-names>
</name>
<name>
<surname>Horsman</surname>
<given-names>DE</given-names>
</name>
<name>
<surname>Connors</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Gascoyne</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Marra</surname>
<given-names>MA</given-names>
</name>
<name>
<surname>Jones</surname>
<given-names>SJ</given-names>
</name>
</person-group>
<article-title>De novo transcriptome assembly with abyss</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>21</issue>
<fpage>2872</fpage>
<lpage>2877</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp367</pub-id>
<pub-id pub-id-type="pmid">19528083</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zerbino</surname>
<given-names>DR</given-names>
</name>
<name>
<surname>Birney</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Velvet: algorithms for de novo short read assembly using de bruijn graphs</article-title>
<source>Genome Res</source>
<year>2008</year>
<volume>18</volume>
<issue>5</issue>
<fpage>821</fpage>
<lpage>829</lpage>
<pub-id pub-id-type="doi">10.1101/gr.074492.107</pub-id>
<pub-id pub-id-type="pmid">18349386</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>J</given-names>
</name>
<name>
<surname>He</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Cheung</surname>
<given-names>DW</given-names>
</name>
<name>
<surname>Yiu</surname>
<given-names>S-M</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Xiaoqian</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Liao</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lam</surname>
<given-names>T-W</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler</article-title>
<source>Gigascience</source>
<year>2012</year>
<volume>1</volume>
<issue>1</issue>
<fpage>18</fpage>
<pub-id pub-id-type="doi">10.1186/2047-217X-1-18</pub-id>
<pub-id pub-id-type="pmid">23587118</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Miller</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Koren</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sutton</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Assembly algorithms for next-generation sequencing data</article-title>
<source>Genomics</source>
<year>2010</year>
<volume>95</volume>
<issue>6</issue>
<fpage>315</fpage>
<lpage>327</lpage>
<pub-id pub-id-type="doi">10.1016/j.ygeno.2010.03.001</pub-id>
<pub-id pub-id-type="pmid">20211242</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Almeida</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison-a review</article-title>
<source>Bioinformatics</source>
<year>2003</year>
<volume>19</volume>
<issue>4</issue>
<fpage>513</fpage>
<lpage>523</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btg005</pub-id>
<pub-id pub-id-type="pmid">12611807</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Polychronopoulos</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Weitschek</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Dimitrieva</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Bucher</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Felici</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Almirantis</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Classification of selectively constrained dna elements using feature vectors and rule-based classifiers</article-title>
<source>Genomics</source>
<year>2014</year>
<volume>104</volume>
<issue>2</issue>
<fpage>79</fpage>
<lpage>86</lpage>
<pub-id pub-id-type="doi">10.1016/j.ygeno.2014.07.004</pub-id>
<pub-id pub-id-type="pmid">25058025</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Vitnyi</surname>
<given-names>PMB</given-names>
</name>
</person-group>
<source>An Introduction to Kolmogorov Complexity and Its Applications</source>
<year>2008</year>
<publisher-loc>New York, NY, USA</publisher-loc>
<publisher-name>Springer</publisher-name>
</element-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Almeida</surname>
<given-names>JS</given-names>
</name>
<name>
<surname>Vinga</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Universal sequence map (usm) of arbitrary discrete sequences</article-title>
<source>BMC Bioinf</source>
<year>2002</year>
<volume>3</volume>
<fpage>6</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-3-6</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Giancarlo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Scaturro</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Utro</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Textual data compression in computational biology: a synopsis</article-title>
<source>Bioinformatics</source>
<year>2009</year>
<volume>25</volume>
<issue>13</issue>
<fpage>1575</fpage>
<lpage>1586</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btp117</pub-id>
<pub-id pub-id-type="pmid">19251772</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kuksa</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pavlovic</surname>
<given-names>V</given-names>
</name>
</person-group>
<article-title>Efficient alignment-free dna barcode analytics</article-title>
<source>BMC Bioinf</source>
<year>2009</year>
<volume>10</volume>
<issue>Suppl 14</issue>
<fpage>9</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-S14-S9</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hide</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Burke</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Da Vison</surname>
<given-names>DB</given-names>
</name>
</person-group>
<article-title>Biological evaluation of d2, an algorithm for high-performance sequence comparison</article-title>
<source>J Comput Biol</source>
<year>1994</year>
<volume>1</volume>
<issue>3</issue>
<fpage>199</fpage>
<lpage>215</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.1994.1.199</pub-id>
<pub-id pub-id-type="pmid">8790465</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Teeling</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Meyerdiekers</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Bauer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Glöckner</surname>
<given-names>FO</given-names>
</name>
</person-group>
<article-title>Application of tetranucleotide frequencies for the assignment of genomic fragments</article-title>
<source>Environ Microbiol</source>
<year>2004</year>
<volume>6</volume>
<issue>9</issue>
<fpage>938</fpage>
<lpage>947</lpage>
<pub-id pub-id-type="doi">10.1111/j.1462-2920.2004.00624.x</pub-id>
<pub-id pub-id-type="pmid">15305919</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pride</surname>
<given-names>DT</given-names>
</name>
<name>
<surname>Meinersmann</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Wassenaar</surname>
<given-names>TM</given-names>
</name>
<name>
<surname>Blaser</surname>
<given-names>MJ</given-names>
</name>
</person-group>
<article-title>Evolutionary implications of microbial genome tetranucleotide frequency biases</article-title>
<source>Genome Res</source>
<year>2003</year>
<volume>13</volume>
<fpage>145</fpage>
<lpage>158</lpage>
<pub-id pub-id-type="doi">10.1101/gr.335003</pub-id>
<pub-id pub-id-type="pmid">12566393</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Teeling</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Waldmann</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Lombardot</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Bauer</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Glöckner</surname>
<given-names>FO</given-names>
</name>
</person-group>
<article-title>Tetra: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences</article-title>
<source>BMC Bioinf</source>
<year>2004</year>
<volume>5</volume>
<fpage>163</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-5-163</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Trapnell</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Pop</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Ultrafast and memory-efficient alignment of short dna sequences to the human genome</article-title>
<source>Genome Biol</source>
<year>2009</year>
<volume>10</volume>
<issue>3</issue>
<fpage>25</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2009-10-3-r25</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Langmead</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Salzberg</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>Fast gapped-read alignment with bowtie 2</article-title>
<source>Nat Methods</source>
<year>2012</year>
<volume>9</volume>
<issue>4</issue>
<fpage>357</fpage>
<lpage>359</lpage>
<pub-id pub-id-type="doi">10.1038/nmeth.1923</pub-id>
<pub-id pub-id-type="pmid">22388286</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Needleman</surname>
<given-names>SB</given-names>
</name>
<name>
<surname>Wunsch</surname>
<given-names>CD</given-names>
</name>
</person-group>
<article-title>A general method applicable to the search for similarities in the amino acid sequence of two proteins</article-title>
<source>J Mol Biol</source>
<year>1970</year>
<volume>48</volume>
<issue>3</issue>
<fpage>443</fpage>
<lpage>453</lpage>
<pub-id pub-id-type="doi">10.1016/0022-2836(70)90057-4</pub-id>
<pub-id pub-id-type="pmid">5420325</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26.</label>
<mixed-citation publication-type="other">
<bold>NeoBio: Bioinformatics Algorithms in Java</bold>
[
<ext-link ext-link-type="uri" xlink:href="http://neobio.sourceforge.net/">http://neobio.sourceforge.net/</ext-link>
]</mixed-citation>
</ref>
<ref id="CR27">
<label>27.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Altschul</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gish</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Miller</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Myers</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Lipman</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Basic local alignment search tool</article-title>
<source>J Mol Biol</source>
<year>1990</year>
<volume>215</volume>
<fpage>403</fpage>
<lpage>410</lpage>
<pub-id pub-id-type="doi">10.1016/S0022-2836(05)80360-2</pub-id>
<pub-id pub-id-type="pmid">2231712</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28.</label>
<mixed-citation publication-type="other">
<bold>Blast Package Version 2.2.25–7</bold>
<ext-link ext-link-type="uri" xlink:href="http://packages.ubuntu.com/precise/ncbi-blast+">http://packages.ubuntu.com/precise/ncbi-blast+</ext-link>
</mixed-citation>
</ref>
<ref id="CR29">
<label>29.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fawcett</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>An introduction to roc analysis</article-title>
<source>Pattern Recognit Lett</source>
<year>2006</year>
<volume>27</volume>
<fpage>861</fpage>
<lpage>874</lpage>
<pub-id pub-id-type="doi">10.1016/j.patrec.2005.10.010</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30.</label>
<mixed-citation publication-type="other">
<bold>NCBI Sequence Read Archive</bold>
<ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/sra">http://www.ncbi.nlm.nih.gov/sra</ext-link>
</mixed-citation>
</ref>
<ref id="CR31">
<label>31.</label>
<mixed-citation publication-type="other">
<bold>E. Coli Reads Source</bold>
<ext-link ext-link-type="uri" xlink:href="http://petang.cgu.edu.tw/Bioinfomatics/Lecture/0_HTS/08/HTS_E08.pdf">http://petang.cgu.edu.tw/Bioinfomatics/Lecture/0_HTS/08/HTS_E08.pdf</ext-link>
</mixed-citation>
</ref>
<ref id="CR32">
<label>32.</label>
<mixed-citation publication-type="other">
<bold>Yeast Bowtie Index</bold>
<ext-link ext-link-type="uri" xlink:href="http://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/s_cerevisiae.ebwt.zip">http://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/s_cerevisiae.ebwt.zip</ext-link>
</mixed-citation>
</ref>
<ref id="CR33">
<label>33.</label>
<mixed-citation publication-type="other">
<bold>E. Coli Bowtie Index</bold>
<ext-link ext-link-type="uri" xlink:href="http://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/e_coli.ebwt.zip">http://ftp.ccb.jhu.edu/pub/data/bowtie_indexes/e_coli.ebwt.zip</ext-link>
</mixed-citation>
</ref>
<ref id="CR34">
<label>34.</label>
<mixed-citation publication-type="other">
<bold>Dazzler Assembler for PacBio Reads</bold>
<ext-link ext-link-type="uri" xlink:href="http://www.homolog.us/blogs/blog/2014/02/14/dazzle-assembler-pacbio-reads-gene-myers/">http://www.homolog.us/blogs/blog/2014/02/14/dazzle-assembler-pacbio-reads-gene-myers/</ext-link>
</mixed-citation>
</ref>
<ref id="CR35">
<label>35.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Zhai</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Alignment-free sequence comparison based on next generation sequencing reads</article-title>
<source>J Comput Biol</source>
<year>2013</year>
<volume>20</volume>
<issue>2</issue>
<fpage>64</fpage>
<lpage>79</lpage>
<pub-id pub-id-type="doi">10.1089/cmb.2012.0228</pub-id>
<pub-id pub-id-type="pmid">23383994</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000337  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000337  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021