Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000951 ( Pmc/Corpus ); précédent : 0009509; suivant : 0009520 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Probabilistic topic modeling for the analysis and classification of genomic sequences</title>
<author>
<name sortKey="La Rosa, Massimo" sort="La Rosa, Massimo" uniqKey="La Rosa M" first="Massimo" last="La Rosa">Massimo La Rosa</name>
<affiliation>
<nlm:aff id="I1">ICAR-CNR, National Research Council of Italy, via Pietro Castellino 111, 80131 Napoli, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fiannaca, Antonino" sort="Fiannaca, Antonino" uniqKey="Fiannaca A" first="Antonino" last="Fiannaca">Antonino Fiannaca</name>
<affiliation>
<nlm:aff id="I2">ICAR-CNR, National Research Council of Italy, viale delle Scienze Ed.11, 90128 Palermo, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rizzo, Riccardo" sort="Rizzo, Riccardo" uniqKey="Rizzo R" first="Riccardo" last="Rizzo">Riccardo Rizzo</name>
<affiliation>
<nlm:aff id="I2">ICAR-CNR, National Research Council of Italy, viale delle Scienze Ed.11, 90128 Palermo, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Urso, Alfonso" sort="Urso, Alfonso" uniqKey="Urso A" first="Alfonso" last="Urso">Alfonso Urso</name>
<affiliation>
<nlm:aff id="I2">ICAR-CNR, National Research Council of Italy, viale delle Scienze Ed.11, 90128 Palermo, Italy</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">25916734</idno>
<idno type="pmc">4416183</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4416183</idno>
<idno type="RBID">PMC:4416183</idno>
<idno type="doi">10.1186/1471-2105-16-S6-S2</idno>
<date when="2015">2015</date>
<idno type="wicri:Area/Pmc/Corpus">000951</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000951</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Probabilistic topic modeling for the analysis and classification of genomic sequences</title>
<author>
<name sortKey="La Rosa, Massimo" sort="La Rosa, Massimo" uniqKey="La Rosa M" first="Massimo" last="La Rosa">Massimo La Rosa</name>
<affiliation>
<nlm:aff id="I1">ICAR-CNR, National Research Council of Italy, via Pietro Castellino 111, 80131 Napoli, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fiannaca, Antonino" sort="Fiannaca, Antonino" uniqKey="Fiannaca A" first="Antonino" last="Fiannaca">Antonino Fiannaca</name>
<affiliation>
<nlm:aff id="I2">ICAR-CNR, National Research Council of Italy, viale delle Scienze Ed.11, 90128 Palermo, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Rizzo, Riccardo" sort="Rizzo, Riccardo" uniqKey="Rizzo R" first="Riccardo" last="Rizzo">Riccardo Rizzo</name>
<affiliation>
<nlm:aff id="I2">ICAR-CNR, National Research Council of Italy, viale delle Scienze Ed.11, 90128 Palermo, Italy</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Urso, Alfonso" sort="Urso, Alfonso" uniqKey="Urso A" first="Alfonso" last="Urso">Alfonso Urso</name>
<affiliation>
<nlm:aff id="I2">ICAR-CNR, National Research Council of Italy, viale delle Scienze Ed.11, 90128 Palermo, Italy</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2015">2015</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on
<italic>k</italic>
-mers representation and text mining techniques.</p>
</sec>
<sec>
<title>Methods</title>
<p>The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length
<italic>k</italic>
-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences.</p>
</sec>
<sec>
<title>Results and conclusions</title>
<p>We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Drancourt, M" uniqKey="Drancourt M">M Drancourt</name>
</author>
<author>
<name sortKey="Raoult, D" uniqKey="Raoult D">D Raoult</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gaston, Kj" uniqKey="Gaston K">KJ Gaston</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Drancourt, M" uniqKey="Drancourt M">M Drancourt</name>
</author>
<author>
<name sortKey="Berger, P" uniqKey="Berger P">P Berger</name>
</author>
<author>
<name sortKey="Raoult, D" uniqKey="Raoult D">D Raoult</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hebert, Pdn" uniqKey="Hebert P">PDN Hebert</name>
</author>
<author>
<name sortKey="Ratnasingham, S" uniqKey="Ratnasingham S">S Ratnasingham</name>
</author>
<author>
<name sortKey="Dewaard, Jr" uniqKey="Dewaard J">JR DeWaard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nei, M" uniqKey="Nei M">M Nei</name>
</author>
<author>
<name sortKey="Kumar, Md" uniqKey="Kumar M">MD Kumar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="La Rosa, M" uniqKey="La Rosa M">M La Rosa</name>
</author>
<author>
<name sortKey="Di Fatta, G" uniqKey="Di Fatta G">G Di Fatta</name>
</author>
<author>
<name sortKey="Gaglio, S" uniqKey="Gaglio S">S Gaglio</name>
</author>
<author>
<name sortKey="Giammanco, G" uniqKey="Giammanco G">G Giammanco</name>
</author>
<author>
<name sortKey="Rizzo, R" uniqKey="Rizzo R">R Rizzo</name>
</author>
<author>
<name sortKey="Urso, A" uniqKey="Urso A">A Urso</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="La Rosa, M" uniqKey="La Rosa M">M La Rosa</name>
</author>
<author>
<name sortKey="Rizzo, R" uniqKey="Rizzo R">R Rizzo</name>
</author>
<author>
<name sortKey="Urso, A" uniqKey="Urso A">A Urso</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="La Rosa, M" uniqKey="La Rosa M">M La Rosa</name>
</author>
<author>
<name sortKey="Rizzo, R" uniqKey="Rizzo R">R Rizzo</name>
</author>
<author>
<name sortKey="Urso, A" uniqKey="Urso A">A Urso</name>
</author>
<author>
<name sortKey="Gaglio, S" uniqKey="Gaglio S">S Gaglio</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="La Rosa, M" uniqKey="La Rosa M">M La Rosa</name>
</author>
<author>
<name sortKey="Gaglio, S" uniqKey="Gaglio S">S Gaglio</name>
</author>
<author>
<name sortKey="Rizzo, R" uniqKey="Rizzo R">R Rizzo</name>
</author>
<author>
<name sortKey="Urso, A" uniqKey="Urso A">A Urso</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, M" uniqKey="Li M">M Li</name>
</author>
<author>
<name sortKey="Chen, X" uniqKey="Chen X">X Chen</name>
</author>
<author>
<name sortKey="Li, X" uniqKey="Li X">X Li</name>
</author>
<author>
<name sortKey="Ma, B" uniqKey="Ma B">B Ma</name>
</author>
<author>
<name sortKey="Vitanyi, Pmb" uniqKey="Vitanyi P">PMB Vitanyi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="La Rosa, M" uniqKey="La Rosa M">M La Rosa</name>
</author>
<author>
<name sortKey="Fiannaca, A" uniqKey="Fiannaca A">A Fiannaca</name>
</author>
<author>
<name sortKey="Rizzo, R" uniqKey="Rizzo R">R Rizzo</name>
</author>
<author>
<name sortKey="Urso, A" uniqKey="Urso A">A Urso</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="La Rosa, M" uniqKey="La Rosa M">M La Rosa</name>
</author>
<author>
<name sortKey="Fiannaca, A" uniqKey="Fiannaca A">A Fiannaca</name>
</author>
<author>
<name sortKey="Rizzo, R" uniqKey="Rizzo R">R Rizzo</name>
</author>
<author>
<name sortKey="Urso, A" uniqKey="Urso A">A Urso</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chor, B" uniqKey="Chor B">B Chor</name>
</author>
<author>
<name sortKey="Horn, D" uniqKey="Horn D">D Horn</name>
</author>
<author>
<name sortKey="Goldman, N" uniqKey="Goldman N">N Goldman</name>
</author>
<author>
<name sortKey="Levy, Y" uniqKey="Levy Y">Y Levy</name>
</author>
<author>
<name sortKey="Massingham, T" uniqKey="Massingham T">T Massingham</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Scholkopf, B" uniqKey="Scholkopf B">B Scholkopf</name>
</author>
<author>
<name sortKey="Smola, Aj" uniqKey="Smola A">AJ Smola</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuksa, P" uniqKey="Kuksa P">P Kuksa</name>
</author>
<author>
<name sortKey="Pavlovic, V" uniqKey="Pavlovic V">V Pavlovic</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kuksa, P" uniqKey="Kuksa P">P Kuksa</name>
</author>
<author>
<name sortKey="Pavlovic, V" uniqKey="Pavlovic V">V Pavlovic</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Martinetz, Tm" uniqKey="Martinetz T">TM Martinetz</name>
</author>
<author>
<name sortKey="Berkovich, Sg" uniqKey="Berkovich S">SG Berkovich</name>
</author>
<author>
<name sortKey="Schulten, Kj" uniqKey="Schulten K">KJ Schulten</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fiannaca, A" uniqKey="Fiannaca A">A Fiannaca</name>
</author>
<author>
<name sortKey="La Rosa, M" uniqKey="La Rosa M">M La Rosa</name>
</author>
<author>
<name sortKey="Rizzo, R" uniqKey="Rizzo R">R Rizzo</name>
</author>
<author>
<name sortKey="Urso, A" uniqKey="Urso A">A Urso</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sandberg, R" uniqKey="Sandberg R">R Sandberg</name>
</author>
<author>
<name sortKey="Winberg, G" uniqKey="Winberg G">G Winberg</name>
</author>
<author>
<name sortKey="Br Nden, C I" uniqKey="Br Nden C">C.-i Bränden</name>
</author>
<author>
<name sortKey="Kaske, A" uniqKey="Kaske A">A Kaske</name>
</author>
<author>
<name sortKey="Ernberg, I" uniqKey="Ernberg I">I Ernberg</name>
</author>
<author>
<name sortKey="Coster, J" uniqKey="Coster J">J Cöster</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="Garrity, Gm" uniqKey="Garrity G">GM Garrity</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
<author>
<name sortKey="Cole, Jr" uniqKey="Cole J">JR Cole</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Z" uniqKey="Liu Z">Z Liu</name>
</author>
<author>
<name sortKey="Desantis, Tz" uniqKey="Desantis T">TZ DeSantis</name>
</author>
<author>
<name sortKey="Andersen, Gl" uniqKey="Andersen G">GL Andersen</name>
</author>
<author>
<name sortKey="Knight, R" uniqKey="Knight R">R Knight</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Desantis, Tz" uniqKey="Desantis T">TZ DeSantis</name>
</author>
<author>
<name sortKey="Keller, K" uniqKey="Keller K">K Keller</name>
</author>
<author>
<name sortKey="Karaoz, U" uniqKey="Karaoz U">U Karaoz</name>
</author>
<author>
<name sortKey="Alekseyenko, Av" uniqKey="Alekseyenko A">AV Alekseyenko</name>
</author>
<author>
<name sortKey="Singh, Nns" uniqKey="Singh N">NNS Singh</name>
</author>
<author>
<name sortKey="Brodie, El" uniqKey="Brodie E">EL Brodie</name>
</author>
<author>
<name sortKey="Pei, Z" uniqKey="Pei Z">Z Pei</name>
</author>
<author>
<name sortKey="Andersen, Gl" uniqKey="Andersen G">GL Andersen</name>
</author>
<author>
<name sortKey="Larsen, N" uniqKey="Larsen N">N Larsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Domingos, P" uniqKey="Domingos P">P Domingos</name>
</author>
<author>
<name sortKey="Pazzani, M" uniqKey="Pazzani M">M Pazzani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Steyvers, M" uniqKey="Steyvers M">M Steyvers</name>
</author>
<author>
<name sortKey="Griffiths, T" uniqKey="Griffiths T">T Griffiths</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Perona, P" uniqKey="Perona P">P Perona</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bart, E" uniqKey="Bart E">E Bart</name>
</author>
<author>
<name sortKey="Welling, M" uniqKey="Welling M">M Welling</name>
</author>
<author>
<name sortKey="Perona, P" uniqKey="Perona P">P Perona</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blei, Dm" uniqKey="Blei D">DM Blei</name>
</author>
<author>
<name sortKey="Jordan, Mi" uniqKey="Jordan M">MI Jordan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hu, Dj" uniqKey="Hu D">DJ Hu</name>
</author>
<author>
<name sortKey="Saul, Lk" uniqKey="Saul L">LK Saul</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, S" uniqKey="Kim S">S Kim</name>
</author>
<author>
<name sortKey="Narayanan, S" uniqKey="Narayanan S">S Narayanan</name>
</author>
<author>
<name sortKey="Sundaram, S" uniqKey="Sundaram S">S Sundaram</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Falush, D" uniqKey="Falush D">D Falush</name>
</author>
<author>
<name sortKey="Stephens, M" uniqKey="Stephens M">M Stephens</name>
</author>
<author>
<name sortKey="Pritchard, Jk" uniqKey="Pritchard J">JK Pritchard</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pinoli, P" uniqKey="Pinoli P">P Pinoli</name>
</author>
<author>
<name sortKey="Chicco, D" uniqKey="Chicco D">D Chicco</name>
</author>
<author>
<name sortKey="Masseroli, M" uniqKey="Masseroli M">M Masseroli</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Masseroli, M" uniqKey="Masseroli M">M Masseroli</name>
</author>
<author>
<name sortKey="Chicco, D" uniqKey="Chicco D">D Chicco</name>
</author>
<author>
<name sortKey="Pinoli, P" uniqKey="Pinoli P">P Pinoli</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hofmann, T" uniqKey="Hofmann T">T Hofmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blei, Dm" uniqKey="Blei D">DM Blei</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Griffiths, Tl" uniqKey="Griffiths T">TL Griffiths</name>
</author>
<author>
<name sortKey="Steyvers, M" uniqKey="Steyvers M">M Steyvers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Blei, Dm" uniqKey="Blei D">DM Blei</name>
</author>
<author>
<name sortKey="Ng, Ay" uniqKey="Ng A">AY Ng</name>
</author>
<author>
<name sortKey="Jordan, Mi" uniqKey="Jordan M">MI Jordan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, W" uniqKey="Li W">W Li</name>
</author>
<author>
<name sortKey="Mccallum, A" uniqKey="Mccallum A">A McCallum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Teh, Yw" uniqKey="Teh Y">YW Teh</name>
</author>
<author>
<name sortKey="Jordan, Mi" uniqKey="Jordan M">MI Jordan</name>
</author>
<author>
<name sortKey="Beal, Mj" uniqKey="Beal M">MJ Beal</name>
</author>
<author>
<name sortKey="Blei, Dm" uniqKey="Blei D">DM Blei</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Grun, B" uniqKey="Grun B">B Grun</name>
</author>
<author>
<name sortKey="Hornik, K" uniqKey="Hornik K">K Hornik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Casella, G" uniqKey="Casella G">G Casella</name>
</author>
<author>
<name sortKey="George, Ei" uniqKey="George E">EI George</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cole, Jr" uniqKey="Cole J">JR Cole</name>
</author>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="Cardenas, E" uniqKey="Cardenas E">E Cardenas</name>
</author>
<author>
<name sortKey="Fish, J" uniqKey="Fish J">J Fish</name>
</author>
<author>
<name sortKey="Chai, B" uniqKey="Chai B">B Chai</name>
</author>
<author>
<name sortKey="Farris, Rj" uniqKey="Farris R">RJ Farris</name>
</author>
<author>
<name sortKey="Kulam Syed Mohideen, As" uniqKey="Kulam Syed Mohideen A">aS Kulam-Syed-Mohideen</name>
</author>
<author>
<name sortKey="Mcgarrell, Dm" uniqKey="Mcgarrell D">DM McGarrell</name>
</author>
<author>
<name sortKey="Marsh, T" uniqKey="Marsh T">T Marsh</name>
</author>
<author>
<name sortKey="Garrity, Gm" uniqKey="Garrity G">GM Garrity</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Geer, Ly" uniqKey="Geer L">LY Geer</name>
</author>
<author>
<name sortKey="Marchler Bauer, A" uniqKey="Marchler Bauer A">A Marchler-Bauer</name>
</author>
<author>
<name sortKey="Geer, Rc" uniqKey="Geer R">RC Geer</name>
</author>
<author>
<name sortKey="Han, L" uniqKey="Han L">L Han</name>
</author>
<author>
<name sortKey="He, J" uniqKey="He J">J He</name>
</author>
<author>
<name sortKey="He, S" uniqKey="He S">S He</name>
</author>
<author>
<name sortKey="Liu, C" uniqKey="Liu C">C Liu</name>
</author>
<author>
<name sortKey="Shi, W" uniqKey="Shi W">W Shi</name>
</author>
<author>
<name sortKey="Bryant, Sh" uniqKey="Bryant S">SH Bryant</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="La Rosa, M" uniqKey="La Rosa M">M La Rosa</name>
</author>
<author>
<name sortKey="Fiannaca, A" uniqKey="Fiannaca A">A Fiannaca</name>
</author>
<author>
<name sortKey="Rizzo, R" uniqKey="Rizzo R">R Rizzo</name>
</author>
<author>
<name sortKey="Urso, A" uniqKey="Urso A">A Urso</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wooley, Jc" uniqKey="Wooley J">JC Wooley</name>
</author>
<author>
<name sortKey="Godzik, A" uniqKey="Godzik A">A Godzik</name>
</author>
<author>
<name sortKey="Friedberg, I" uniqKey="Friedberg I">I Friedberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Karatzoglou, A" uniqKey="Karatzoglou A">A Karatzoglou</name>
</author>
<author>
<name sortKey="Meyer, D" uniqKey="Meyer D">D Meyer</name>
</author>
<author>
<name sortKey="Hornik, K" uniqKey="Hornik K">K Hornik</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chang, Cc" uniqKey="Chang C">CC Chang</name>
</author>
<author>
<name sortKey="Lin, Cj" uniqKey="Lin C">CJ Lin</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article" xml:lang="en">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">25916734</article-id>
<article-id pub-id-type="pmc">4416183</article-id>
<article-id pub-id-type="publisher-id">1471-2105-16-S6-S2</article-id>
<article-id pub-id-type="doi">10.1186/1471-2105-16-S6-S2</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Probabilistic topic modeling for the analysis and classification of genomic sequences</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes" id="A1">
<name>
<surname>La Rosa</surname>
<given-names>Massimo</given-names>
</name>
<xref ref-type="aff" rid="I1">1</xref>
<email>larosa@pa.icar.cnr.it</email>
</contrib>
<contrib contrib-type="author" id="A2">
<name>
<surname>Fiannaca</surname>
<given-names>Antonino</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
</contrib>
<contrib contrib-type="author" id="A3">
<name>
<surname>Rizzo</surname>
<given-names>Riccardo</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
</contrib>
<contrib contrib-type="author" id="A4">
<name>
<surname>Urso</surname>
<given-names>Alfonso</given-names>
</name>
<xref ref-type="aff" rid="I2">2</xref>
</contrib>
</contrib-group>
<aff id="I1">
<label>1</label>
ICAR-CNR, National Research Council of Italy, via Pietro Castellino 111, 80131 Napoli, Italy</aff>
<aff id="I2">
<label>2</label>
ICAR-CNR, National Research Council of Italy, viale delle Scienze Ed.11, 90128 Palermo, Italy</aff>
<pub-date pub-type="collection">
<year>2015</year>
</pub-date>
<pub-date pub-type="epub">
<day>17</day>
<month>4</month>
<year>2015</year>
</pub-date>
<volume>16</volume>
<issue>Suppl 6</issue>
<supplement>
<named-content content-type="supplement-title">Selected articles from the 10th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics</named-content>
<named-content content-type="supplement-editor">Enrico Formenti, Roberto Tagliaferri, Ernst Wit and Riccardo Rizzo</named-content>
<named-content content-type="supplement-sponsor">Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. Articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.</named-content>
</supplement>
<fpage>S2</fpage>
<lpage>S2</lpage>
<permissions>
<copyright-statement>Copyright © 2015 La Rosa et al.; licensee BioMed Central Ltd.</copyright-statement>
<copyright-year>2015</copyright-year>
<copyright-holder>La Rosa et al.; licensee BioMed Central Ltd.</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0">
<license-p>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0">http://creativecommons.org/licenses/by/4.0</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<self-uri xlink:href="http://www.biomedcentral.com/1471-2105/16/S6/S2"></self-uri>
<abstract>
<sec>
<title>Background</title>
<p>Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on
<italic>k</italic>
-mers representation and text mining techniques.</p>
</sec>
<sec>
<title>Methods</title>
<p>The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length
<italic>k</italic>
-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences.</p>
</sec>
<sec>
<title>Results and conclusions</title>
<p>We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased.</p>
</sec>
</abstract>
<kwd-group>
<kwd>Probabilistic topic model</kwd>
<kwd>ultra short sequence classification</kwd>
<kwd>LDA</kwd>
</kwd-group>
<conference>
<conf-date>20-22 June 2013</conf-date>
<conf-name>10th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics</conf-name>
<conf-loc>Nice, France</conf-loc>
</conference>
</article-meta>
</front>
<body>
<sec>
<title>Background</title>
<p>The study of genomic sequences for classification and taxonomic purposes has a leading role both in microbial identification [
<xref ref-type="bibr" rid="B1">1</xref>
], with important consequences in the biomedical field, and in the classification of living species such as animals or plants, for studies about the biodiversity of different ecosystems [
<xref ref-type="bibr" rid="B2">2</xref>
]. These kinds of analysis are carried out focusing only on a well defined region of the genome, usually referred as barcode genes: for example the 16S rRNA gene for bacteria [
<xref ref-type="bibr" rid="B3">3</xref>
], and the cytochrome c oxidase I (COI) for animals [
<xref ref-type="bibr" rid="B4">4</xref>
]. The first computational approaches with these data were based on sequence alignments and sequence similarities, obtained through the evolutionary distances, with already identified genomic sequences [
<xref ref-type="bibr" rid="B5">5</xref>
]. More recently, novel machine learning and data mining methodologies have been developed. For example clustering algorithms, which are unsupervised techniques able to find groups of similar objects, have been applied for the identification of the taxonomic rank of bacteria isolates. The aim of this approach was to find a correlation between clusters and collections of bacteria belonging to the same taxon (taxonomic category). Clustering techniques have been used considering similarity among gene sequences expressed both in terms of classic evolutionary models [
<xref ref-type="bibr" rid="B6">6</xref>
,
<xref ref-type="bibr" rid="B7">7</xref>
], and in terms of compression-based models [
<xref ref-type="bibr" rid="B8">8</xref>
,
<xref ref-type="bibr" rid="B9">9</xref>
], that derive their theoretic assumption from the information theory concepts of Universal Similarity Metric [
<xref ref-type="bibr" rid="B10">10</xref>
]. The compression-based approaches have been also adopted for the study of phylogenetic relationships among animal species, considering the barcode COI gene [
<xref ref-type="bibr" rid="B11">11</xref>
,
<xref ref-type="bibr" rid="B12">12</xref>
].</p>
<p>Recent alignment-free computational approaches consider genomic sequences as a collection of
<italic>k </italic>
-mers. A
<italic>k </italic>
-mer is a small fragment of DNA string of size
<italic>k</italic>
. In bioinformatics domain a
<italic>k </italic>
-mer representation has been used in many works. For example, a deep analysis of
<italic>k </italic>
-mer spectra has been carried out in [
<xref ref-type="bibr" rid="B13">13</xref>
]; a vector representation of DNA sequence using
<italic>k </italic>
-mers has been adopted for classification task using Support Vector Machines (SVM) [
<xref ref-type="bibr" rid="B14">14</xref>
] in [
<xref ref-type="bibr" rid="B15">15</xref>
,
<xref ref-type="bibr" rid="B16">16</xref>
], and using Neural Gas algorithm [
<xref ref-type="bibr" rid="B17">17</xref>
] in [
<xref ref-type="bibr" rid="B18">18</xref>
];
<italic>k </italic>
-mer occurrences in genomic sequences have been considered for training a Naive Bayesian classifier [
<xref ref-type="bibr" rid="B19">19</xref>
,
<xref ref-type="bibr" rid="B20">20</xref>
]. Two of the most accurate sequence classifiers that adopt a
<italic>k </italic>
-mer representation, as shown in [
<xref ref-type="bibr" rid="B21">21</xref>
], are the RDP classifier [
<xref ref-type="bibr" rid="B20">20</xref>
] and the Simrank algorithm [
<xref ref-type="bibr" rid="B22">22</xref>
]. The RDP tool trains a Naive Bayesian classifier [
<xref ref-type="bibr" rid="B23">23</xref>
] using as input data the frequency occurrence of
<italic>k </italic>
-mers of a 16S gene dataset; the fitted probabilistic model is then able to predict the taxonomic label of an unknown (unlabeled) sequence. Simrank tool is a search algorithm that employs
<italic>k </italic>
-mers representation in order to speed up the sequence similarity searches between an unknown query sequence and a repository of tagged 16S genomic strings.</p>
<p>In this work we propose a new computational method for sequence classification based on
<italic>k </italic>
-mers representation and text mining techniques. If we consider DNA sequences as documents and the related
<italic>k </italic>
-mers as words, it is possible to extract the most recurrent themes, or topics, shared by the corpus of sequences. Since similar text documents about specific issues, like economy or biology, share the same topics, our thesis is to demonstrate that sequences belonging to the same most recurring themes (topics) have strong similarities among them and belong to the same taxonomic rank. For this reason, our approach is based on the probabilistic topic modeling methodology [
<xref ref-type="bibr" rid="B24">24</xref>
], usually adopted for identification and classification of text documents. Probabilistic topic models, in fact, are algorithms that, given a set of text documents called corpus, extract a group of probability distributions over the words in the documents, i.e. the topics. Our aim is then to learn a probabilistic topic model using this representation, in order to extract the most probable topics from the DNA corpus. The extracted topics will be used to classify unknown test sequences. Apart from text documents, topic models have been also adopted for the analysis of image, audio and music data. In image processing, it is assumed that similar collections of images share the same visual patterns (representing the topics). This way topic modeling has been applied for example for image classification [
<xref ref-type="bibr" rid="B25">25</xref>
], for building image hierarchies [
<xref ref-type="bibr" rid="B26">26</xref>
] and for linking captions and images [
<xref ref-type="bibr" rid="B27">27</xref>
]. In order to infer musical key-profiles of classical music, music files have been considered as text documents, musical notes as words and musical key-profiles as topics [
<xref ref-type="bibr" rid="B28">28</xref>
]. Topic modeling has also been used for audio information retrieval, as in [
<xref ref-type="bibr" rid="B29">29</xref>
]: authors adopted Latent Dirichlet Allocation (LDA) as topic model, and they considered one of the parameter of the fitted model (namely the posterior Dirichlet parameter) as a feature vector in order to perform classification by means of the SVM algorithm. In bioinformatics, topic models have been applied to genomic data by [
<xref ref-type="bibr" rid="B30">30</xref>
], in order to find the topics, representing a genetic signature, belonging to a population with a shared ancestral parent. Moreover authors in [
<xref ref-type="bibr" rid="B31">31</xref>
,
<xref ref-type="bibr" rid="B32">32</xref>
] applied the probabilistic Latent Semantic Analysis (pLSA) topic model [
<xref ref-type="bibr" rid="B33">33</xref>
] in order to predict annotations of Gene Ontology (GO) terms using only the previously available GO annotations.</p>
<p>We carried out experiments on a rich bacterial dataset, more than 7000 sequences, also including ultra-short sequences (length
<italic></italic>
50 bp), in order to consider the robustness of the proposed approach with respect to sequence length. Classification results were compared with the ones provided by the RDP classifier and the SVM classifier.</p>
<p>The rest of the paper is structured as follows: the Methods section reports the computational tools used in the paper, with a focus on the probabilistic topic model adopted and our document paradigm for DNA sequences; the Results and discussion section presents the datasets used and the classification results; finally the conclusions are drawn.</p>
</sec>
<sec sec-type="methods">
<title>Methods</title>
<p>In this Section we present our computational approaches to the analysis and classification of genomic sequences. After a brief description of probabilistic topic models, we formalize our document paradigm for gene sequences, then we explain our experimental pipelines both for the training and the testing phases.</p>
<sec>
<title>Probabilistic topic models</title>
<p>Probabilistic Topic Models are machine learning techniques adopted in text mining field, in order to mine semantic information from a set of documents, called corpus [
<xref ref-type="bibr" rid="B34">34</xref>
,
<xref ref-type="bibr" rid="B35">35</xref>
,
<xref ref-type="bibr" rid="B24">24</xref>
]. Given a document corpus, probabilistic topic models are able to find a group of recurring themes, called indeed
<italic>topics</italic>
, that are typical of certain classes of documents. For example, financial papers or scientific papers will exhibit different topics according to their specific arguments. Topics are actually probability distributions over the words of the documents. Imagine that we have a fixed vocabulary that is used to generate our document corpus. A topic is a distribution over this vocabulary: for example the economy topic has words about money and trade, and the biology topic has words about life and cells. Assume that these topics are defined before any document is generated: in order to write a document about the impact on the market of a new biologic discover we will use words from both biology and economy topics. The goal of topic modeling is to automatically discover the topics from a collection of documents. More formally, if we assume that all the topics are defined before the documents are created, then each document belonging to a collection can be generated in two steps. First of all a topic is randomly selected according to the probability distribution over topics for the kind of documents we want to generate; secondly a word is randomly chosen with respect to the distribution over the vocabulary for that topic. If we assume that a document
<italic>d </italic>
is a sequence of
<italic>Q </italic>
words,
<italic>d </italic>
= (
<italic>w</italic>
<sub>1</sub>
<italic>, w</italic>
<sub>2</sub>
<italic>, ..., w
<sub>Q</sub>
</italic>
), the generative model for documents can be expressed by means of the following probability distribution:</p>
<p>
<disp-formula id="bmcM1">
<label>(1)</label>
<mml:math id="M1" name="1471-2105-16-S6-S2-i1" overflow="scroll">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:mi>z</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where
<italic>P </italic>
(
<italic>w
<sub>i</sub>
</italic>
) is the probability of the word
<italic>w
<sub>i </sub>
</italic>
in a given document;
<italic>P </italic>
(
<italic>z </italic>
=
<italic>z
<sub>j </sub>
</italic>
) is the probability of choosing a word from topic
<italic>z
<sub>j </sub>
</italic>
for the current document;
<italic>P </italic>
(
<italic>w
<sub>i</sub>
|z </italic>
=
<italic>z
<sub>j </sub>
</italic>
) is the probability of sampling the word
<italic>w
<sub>i</sub>
</italic>
, given the topic
<italic>zj </italic>
;
<italic>T </italic>
is the number of topics. Given the words, representing the observable variables, into a corpus of documents, a probabilistic topic model is learned by estimating the topic distributions per document and the words distribution per topic, representing the hidden variables. The number
<italic>T </italic>
of topics is a model parameter and it has to be fixed a priori. There are several algorithms used to infer a probabilistic topic model. One of the earliest topic model is the Probabilistic Latent Semantic Analysis (pLSA) algorithm [
<xref ref-type="bibr" rid="B33">33</xref>
]. In pLSA, each document is represented as a set of the mixing proportions among the topics, but it is not defined a generative probabilistic model [
<xref ref-type="bibr" rid="B36">36</xref>
]. That means that it is not possible to assign a topic distribution to documents not belonging to the training set. Because of that, in our work we selected the Latent Dirichlet Allocation (LDA) [
<xref ref-type="bibr" rid="B36">36</xref>
] as probabilistic topic model. LDA is one of the simplest algorithm to infer the topics distributions from the generative document model, defined in Eq. 1, and, unlikely pLSA, it provides a fitted model that is able to assign a topic distribution to test documents (i.e. not belonging to the corpus used to train the model) by computing its posterior probability, defined as the conditional distributions of topics given the words in the document. The generative model introduced by LDA is defined as follows.
<italic>P </italic>
(
<italic>w|z</italic>
) is represented as a set of
<italic>T </italic>
multinomial distributions
<italic>φ </italic>
over all the
<italic>W </italic>
unique words of the joint set of documents:
<italic>P </italic>
(
<italic>w|z </italic>
=
<italic>z
<sub>j </sub>
</italic>
) =
<italic>φ</italic>
(
<italic>j</italic>
).
<italic>P </italic>
(
<italic>z</italic>
) is represented as a set of
<italic>D</italic>
, the number of documents
<italic>d </italic>
in the corpus, multinomial distributions
<italic>θ </italic>
over the
<italic>T </italic>
topics:
<italic>P </italic>
(
<italic>z </italic>
=
<italic>z
<sub>j </sub>
</italic>
) =
<italic>θ</italic>
(
<italic>d</italic>
). Documents are then generated by first selecting a distribution over topics
<italic>θ </italic>
from a Dirichlet distribution. The words in the document are generated by selecting a topic
<italic>zj </italic>
from this distribution and then by selecting a word from this topic, using the distribution
<italic>P </italic>
(
<italic>w|z </italic>
=
<italic>z
<sub>j </sub>
</italic>
) that is determined from another Dirichlet distribution. More formally, LDA's generative model can be summarized in the following steps:</p>
<p>1 The word distribution
<italic>φ </italic>
for each topic, representing the probability of a word occurring in a given topic, is set as</p>
<p>
<disp-formula id="bmcM2">
<label>(2)</label>
<mml:math id="M2" name="1471-2105-16-S6-S2-i2" overflow="scroll">
<mml:mrow>
<mml:mi>φ</mml:mi>
<mml:mo class="MathClass-rel"></mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>δ</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where
<italic></italic>
means "is distributed as".</p>
<p>2 The proportions
<italic>θ </italic>
of the topic distribution for the document
<italic>d </italic>
are set as</p>
<p>
<disp-formula id="bmcM3">
<label>(3)</label>
<mml:math id="M3" name="1471-2105-16-S6-S2-i3" overflow="scroll">
<mml:mrow>
<mml:mi>θ</mml:mi>
<mml:mo class="MathClass-rel"></mml:mo>
<mml:mi>D</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>h</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:mi>α</mml:mi>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>3 For every word
<italic>w
<sub>i</sub>
</italic>
</p>
<p>(a) Select a topic
<italic>z
<sub>j </sub>
</italic>
Multinomial(
<italic>θ</italic>
).</p>
<p>(b) Select a word
<italic>w
<sub>i </sub>
</italic>
from a multinomial probability distribution given the topic
<italic>zj </italic>
:
<italic>p</italic>
(
<italic>w
<sub>i</sub>
|z
<sub>j </sub>
, φ</italic>
).</p>
<p>More complex topic models, like Pachinko Allocation Model (PAM) [
<xref ref-type="bibr" rid="B37">37</xref>
] and Hierarchical Dirichlet Processes (HDP) [
<xref ref-type="bibr" rid="B38">38</xref>
] were not taken into account. PAM is able to find correlations between topics. In our work, however, we are not interested in inter-topic correlation because we suppose that topics, related to taxonomic ranks in our framework, are independent each other. HDP is an extended version of LDA since it estimates the number of topics. In this work, as explained in section Results and discussion, we are also interested in how classification results vary depending on the number of topics. For this reason we prefer the LDA model because it allows us to select a priori the number of topics for our experiments.</p>
</sec>
<sec>
<title>Document paradigm for gene sequences</title>
<p>In this work, probabilistic topic models have been adopted for the study of genomic sequences. Since topic models have been developed for text mining activities, we set up a parallelism between text documents and gene sequences. In our framework, a single DNA sequence, considering only the nucleotide sequence without any header like for instance in the fasta format, represents a document. A dataset of sequences can then be considered as the corpus of the documents. On the other hand, a DNA sequence is composed of only one text string, defined on a fixed alphabet (A, C, G, T). Words can be extracted from gene sequences following the so-called
<italic>k </italic>
-mer decomposition. As shown in Figure
<xref ref-type="fig" rid="F1">1</xref>
, for each sequence in the corpus, all the overlapping
<italic>k </italic>
-mers can be extracted with a sliding window of fixed length
<italic>k </italic>
(with
<italic>k </italic>
= 8 in the figure). The position of a
<italic>k </italic>
-mer in the original sequence is not taken into account, according to the bag-of-words model used in text analysis. The collection of all the extracted
<italic>k </italic>
-mers represents, for each sequence, the set of words.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption>
<p>
<bold>k-mers decomposition</bold>
. By means of a sliding windows of fixed size
<italic>k, k </italic>
= 8 in this case, it is possible to extract all the overlapping k-mers, representing the words, from a gene sequence.</p>
</caption>
<graphic xlink:href="1471-2105-16-S6-S2-1"></graphic>
</fig>
</sec>
<sec>
<title>Finding topics in gene sequences</title>
<p>Using the document paradigm described in "Document paradigm for gene sequences" section, we applied probabilistic topic models to a corpus of nucleotide sequences in order to extract the topics by means of the LDA algorithm. We aim at demonstrating that similar sequences share the same group of most probable topics, so that if it is possible to assign a taxonomic label to those topics, we are able to classify the sequences with respect to their topic distributions. Moreover, using a fitted model, we can also predict the taxonomic rank of an unknown sequence, considering the label of its highest probable topic. The methodologies adopted to assign a label to the topics and to find the most probable topic of a test sequence will be described in the following Sections.</p>
</sec>
<sec>
<title>Training workflow</title>
<p>Our proposed procedure for training probabilistic topic models of genomic dataset is shown in the workflow of Figure
<xref ref-type="fig" rid="F2">2</xref>
, where round rectangles are processing steps and parallelograms stand for input or output data and models. All the sequences of the DNA corpus are first decomposed with the
<italic>k </italic>
-mer representation in order to extract their words. Then the LDA algorithm is used in order to infer the probabilistic topic model for a fixed number of topics
<italic>T</italic>
. In this work we used the LDA implementation provided by the R package
<italic>topicmodels </italic>
[
<xref ref-type="bibr" rid="B39">39</xref>
]. The LDA model, defined as in section Methods, is fitted using Gibbs sampling [
<xref ref-type="bibr" rid="B40">40</xref>
] and considering parameter values as suggested in [
<xref ref-type="bibr" rid="B35">35</xref>
]. Given a fitted topic model, it is possible to obtain the posterior topic distributions for each sequence of the corpus. For each sequence
<italic>di </italic>
in the training set, the topic assigned to each training sequence is defined as:</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption>
<p>
<bold>Training workflow</bold>
. From the sequences of the input DNA dataset are extracted the words through the k-mer decomposition; then using the Latent Dirichlet Allocation (LDA) algorithm a probabilistic topic model is learned. The model provides the topic distribution of the input dataset, retrieved from the Ribosomal Database Project (RDP) online repository, and the most probable topics are labeled with a taxonomic rank using a majority voting scheme.</p>
</caption>
<graphic xlink:href="1471-2105-16-S6-S2-2"></graphic>
</fig>
<p>
<disp-formula id="bmcM4">
<label>(4)</label>
<mml:math id="M4" name="1471-2105-16-S6-S2-i4" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mtext>arg</mml:mtext>
<mml:munder class="msub">
<mml:mrow>
<mml:mtext>max</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mi>j</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>.</mml:mi>
<mml:mi>.</mml:mi>
<mml:mi>.</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo class="MathClass-punc">;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>.</mml:mi>
<mml:mi>.</mml:mi>
<mml:mi>.</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where
<italic>T </italic>
is the number of topics,
<italic>N </italic>
is the number of sequences in the training set and
<italic>P </italic>
(
<italic>z|d</italic>
) is the topic distribution for document.</p>
<p>In order to assign a taxonomic label to each topic, we adopted a majority voting scheme. In fact, we decided to give each topic the taxonomic label belonging to the most of sequences that exhibit that topic with the highest probability. For each topic
<italic>z
<sub>j </sub>
</italic>
and considering only the documents
<italic>d
<sub>i </sub>
</italic>
assigned to that topic according to Eq. 4, the taxonomic label
<inline-formula>
<mml:math id="M5" name="1471-2105-16-S6-S2-i5" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
of topic
<italic>z
<sub>j </sub>
</italic>
is defined as:</p>
<p>
<disp-formula id="bmcM5">
<label>(5)</label>
<mml:math id="M6" name="1471-2105-16-S6-S2-i6" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">:</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mtext>argma</mml:mtext>
<mml:msub>
<mml:mrow>
<mml:mtext>x</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mstyle displaystyle="true">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mo></mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:munderover>
</mml:mstyle>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-punc">;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-punc">;</mml:mo>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>h</mml:mi>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mo></mml:mo>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:mi>R</mml:mi>
<mml:mo class="MathClass-punc">;</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>
<disp-formula id="bmcM6">
<label>(6)</label>
<mml:math id="M7" name="1471-2105-16-S6-S2-i7" overflow="scroll">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo class="MathClass-open">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-punc">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo class="MathClass-close">)</mml:mo>
</mml:mrow>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfenced open="{">
<mml:mrow>
<mml:mtable class="array" columnlines="none" equalcolumns="false" equalrows="false">
<mml:mtr>
<mml:mtd class="array" columnalign="left">
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd class="array" columnalign="left">
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd class="array" columnalign="left">
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd class="array" columnalign="left">
<mml:mi>i</mml:mi>
<mml:mi>f</mml:mi>
<mml:mspace class="tmspace" width="2.77695pt"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo class="MathClass-rel"></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where
<inline-formula>
<mml:math id="M8" name="1471-2105-16-S6-S2-i8" overflow="scroll">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
is the label of sequence
<italic>d
<sub>i</sub>
</italic>
;
<italic>R </italic>
is the number of sequences belonging to topic
<italic>z
<sub>j </sub>
</italic>
;
<italic>f </italic>
(
<italic>d
<sub>i</sub>
, d
<sub>k </sub>
</italic>
) is a function that is equal to 1 if the label of sequences
<italic>d
<sub>i </sub>
</italic>
and
<italic>d
<sub>k </sub>
</italic>
are the same, 0 otherwise.</p>
<p>At the end of the training phase, we then obtain a fitted probabilistic topic model and a set of topics representing the taxonomic ranks of the input DNA corpus.</p>
</sec>
<sec>
<title>Testing workflow</title>
<p>The testing procedure of our proposed method works as described in Figure
<xref ref-type="fig" rid="F3">3</xref>
. Test sequences are first decomposed into their
<italic>k </italic>
-mers, then the fitted topic model trained during the learning phase (Figure
<xref ref-type="fig" rid="F2">2</xref>
) is used to compute the topic distributions of the test sequences. Afterwards each sequence is assigned to its most probable topic, according to Eq. 5 and considering only the
<italic>M </italic>
sequences in the test set.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption>
<p>
<bold>Testing workflow</bold>
. From the test sequences are extracted the words through the k-mer decomposition; then, by means of fitted topic models learned during the training phase, the topic distributions of test sequences are computed. Finally each sequence is assigned to its most probable topic, and, since topics have been labeled during the training phase, the predicted rank for the test sequences is obtained.</p>
</caption>
<graphic xlink:href="1471-2105-16-S6-S2-3"></graphic>
</fig>
<p>Since, as said in "Training workflow" section, each topic has been labeled with a taxonomic rank during the training procedure, at the end of the testing phase we obtain the predicted taxonomic assignment for the test sequences. The prediction performance of our proposed approach can then be measured using the precision score, defined as:</p>
<p>
<disp-formula id="bmcM7">
<label>(7)</label>
<mml:math id="M9" name="1471-2105-16-S6-S2-i9" overflow="scroll">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo class="MathClass-rel">=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>r</mml:mi>
<mml:mi>u</mml:mi>
<mml:mi>e</mml:mi>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>e</mml:mi>
<mml:mo class="MathClass-bin">+</mml:mo>
<mml:mi>f</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>e</mml:mi>
<mml:mspace class="thinspace" width="0.3em"></mml:mspace>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
</p>
<p>where true positives (TP) are correctly classified test sequences, that is their predicted label matches with the topic label; otherwise false positives (FP) represent misclassified test sequences.</p>
</sec>
</sec>
<sec>
<title>Results and discussion</title>
<p>In this Section we present the 16S bacteria dataset used and we describe both the experiments settings and the results obtained using the probabilistic topic modeling approach for sequence classification. Our results are compared with other two algorithms used for sequence classification: the RDP classifier and the support vector machine classifier.</p>
<sec>
<title>Datasets used</title>
<p>We evaluated our approach for gene sequences classification considering bacteria species. For classification and taxonomic studies of bacteria, it is usually considered only a limited part of the genome, about 1200-1400 bp, that is the housekeeping 16S rRNA gene [
<xref ref-type="bibr" rid="B3">3</xref>
]. In our study we arranged a 16S dataset downloading the gene sequences from the Ribosomal Database Project (RDP) repository [
<xref ref-type="bibr" rid="B41">41</xref>
], release 10.32. We chose the four richest phyla, Actinobacteria, Bacteroidetes, Firmicutes, Proteobacteria, and, in order to retain a good quality dataset, we selected the 16S sequences that satisfy the following constraints:</p>
<p>1 type strain, representing reference specimen;</p>
<p>2 size
<italic></italic>
1200 bp, considering this way full gene sequences;</p>
<p>3 good quality, according to the quality parameters provided by the RDP repository;</p>
<p>4 NCBI taxonomy, i.e. sequences are labeled with the NCBI taxonomic nomenclature [
<xref ref-type="bibr" rid="B42">42</xref>
].</p>
<p>Moreover we left out unclassified sequences and taxonomic ranks with lesser than ten sequences, in order to obtain a well balanced dataset. Using these criteria, we set up a 16S dataset consisting of 7856 sequences, whose main features are summarized in Table
<xref ref-type="table" rid="T1">1</xref>
.</p>
<table-wrap id="T1" position="float">
<label>Table 1</label>
<caption>
<p>Main features of the 16S bacteria Dataset.</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="right">
<bold>phylum</bold>
</td>
<td align="right">
<bold># sequence</bold>
</td>
<td align="right">
<bold># class</bold>
</td>
<td align="right">
<bold># order</bold>
</td>
<td align="right">
<bold># family</bold>
</td>
</tr>
<tr>
<td align="right">Actinobacteria</td>
<td align="right">2165</td>
<td align="right">1</td>
<td align="right">3</td>
<td align="right">26</td>
</tr>
<tr>
<td align="right">Bacteroidetes</td>
<td align="right">760</td>
<td align="right">3</td>
<td align="right">3</td>
<td align="right">11</td>
</tr>
<tr>
<td align="right">Firmicutes</td>
<td align="right">1758</td>
<td align="right">4</td>
<td align="right">7</td>
<td align="right">29</td>
</tr>
<tr>
<td align="right">Proteobacteria</td>
<td align="right">3173</td>
<td align="right">5</td>
<td align="right">29</td>
<td align="right">66</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>Experimental setup</title>
<p>The experiments proposed in this paper, aimed at validating the probabilistic topic modeling approach, represent an expansion and an in-depth analysis of our previous work [
<xref ref-type="bibr" rid="B43">43</xref>
]. There, with a smaller dataset of 3000 sequences, we carried out a series of trials, using a tenfold cross-validation procedure, in order to test how the classification results varied with regards to the number of topics and the dataset composition. We obtained, with
<italic>k </italic>
-mer size = 8, global results ranging from 99% of precision score at phylum taxonomic level to 80% at family level. In all cases, we noticed that the best scores were reached only when the number of apriori fixed topics is at least equal to the number of different categories of the input dataset. For example, if we want to classify our dataset at order level, we have to train a topic model with a number of topics equal or greater than the number of orders. Of course only in an ideal situation the number of topics matches exactly with the number of categories, in fact in our previous study we obtained better results with a larger number of topics, about two times the number of categories, considering a situation in which each different class covers, in average, two most probable topics. In this work, we enriched that experimental pipeline first of all taking into account a bigger dataset consisting of 7856 gene sequences, described in "Dataset used" section. Moreover, in order to tune the choice of the number of topics, the probabilistic topic models were trained in a hierarchical way. That means we fitted a different topic model at each taxonomic level, for the four different phylum. Considering the Firmicutes phylum, for instance, in order to classify at class level, we trained a model considering an input training set composed of all the Firmicutes sequences. In order to classify at order level, we trained a different topic model for each of the four different classes of Firmicutes phylum (look at Table 1 for info about the number of categories of our bacteria dataset), and so on. As a general rule, we considered, for each topic model a number of topics equal to one time and two times the number of lower categories: if one class has four orders, for that class we trained a topic model with four and eight topics. Once again all the tests have been carried out by means of a ten fold cross-validation procedure.</p>
<p>Unlike our previous work, in this paper we also evaluate the robustness and the generalization ability of our approach with respect to the sequences length. For this reason, we tested our method also with small sized sequences, considering respectively sequence fragments of 400, 200, 100, 50, 40, 25 bp. In this case we submit to the testing workflow a fragment of length
<italic>f </italic>
(with
<italic>f </italic>
= 25, 40, 50 and so on) randomly extracted from the full length sequence and we consider the output classification. The need of a robust classifier able to correctly predict the taxonomic rank of small DNA fragments is of fundamental importance in metagenomics applications, where genetic sequences are mainly extracted from environmental species and in many cases ultra short sequences, with size
<italic></italic>
50 bp, are available [
<xref ref-type="bibr" rid="B44">44</xref>
].</p>
<p>Classification results, in terms of precision scores (Eq. 7), were compared with other two sequence classifiers: the RDP classifier [
<xref ref-type="bibr" rid="B20">20</xref>
], and the SVM classifier. The former consists of a naive Bayesian classifier trained on a
<italic>k </italic>
-mer representation of the sequences, the latter works on a vector representation of the gene sequences obtained considering the number of
<italic>k </italic>
-mers occurrences. We adopted the SVM implementation provided by the R package
<italic>e1071 </italic>
[
<xref ref-type="bibr" rid="B45">45</xref>
], that allows a simple interface with the well known LIBSVM library [
<xref ref-type="bibr" rid="B46">46</xref>
]. SVM has been run with default parameters and Gaussian Radial Basis kernel.</p>
</sec>
<sec>
<title>Experimental results</title>
<p>The precision scores obtained using our probabilistic topic modeling approach, the RDP classifier and the SVM classifier, for the 16S rRNA dataset described in Section Training dataset, are organized in the charts of Figures
<xref ref-type="fig" rid="F4">4</xref>
to
<xref ref-type="fig" rid="F7">7</xref>
. The precision scores are average results obtained by means of a ten fold cross-validation procedure. Each chart shows the score trends as a function of the fragment size (full length, 400 bp, 200 bp, 150 bp, 50 bp, 40 bp, 25 bp) at a different taxonomic rank, from phylum to family. Unfortunately, RDP classifier works only with sequences of at least 50 bp: in fact with fragments of size 40 bp and 25 bp it is unable to provide a classification results. This way precision scores for 40 bp and 25 bp fragments have been linearly extrapolated for the RDP curve. In all the charts, extrapolated values are represented with the dashed line for the RDP curve. From all the charts, it is immediately clear that the SVM classifier provides acceptable precision results, ranging from 99% at phylum level down to 97% at family level, only when applied to full length sequences. In all other situations, the SVM algorithm drops significantly its performances. In fact, with sequence sizes from 400 bp to 25 bp, the SVM looses completely its predictive power, resulting useless when applied to sequence fragments. This behaviour reflects the fact that the vector representation of sequence fragments is quite different from the vector representation of the full sequences composing the training set. SVM, therefore, is not able to generalize the prediction of small sequences. Our approach, briefly called LDA approach from here on, and the RDP classifier show, on the other hand, more robust an significant results. LDA and RDP, in fact, always produce very similar results, at each taxonomic level and for each sequence size, from full length to 50 bp. The LDA's precision scores are slightly lower than the results obtained through the RDP classifier, with an average spread within 10%, and maximum scores greater than 70% in each case. Our LDA approach shows its effectiveness when applied to ultra short fragments, i.e. 50 bp, 40 bp and 25 bp. Considering 50 bp fragments, the LDA and the RDP scores are very close, within 5% of difference, but while the RDP classifier works only with fragment size of at least 50 bp, our LDA approach gives very reliable results, about 70%, even with fragment size of 40 and 25 bp. That means, for example, that with only 25 nucleotides, we are able to predict the family of an unknown sequence with a 70% confidence. Moreover at class, order and family level, our LDA approach not only gives an affordable classification results, but if we compare these scores with the ones extrapolated for the RDP classifier, we obtain higher scores. This behaviour is evident above all in the family case, Figure
<xref ref-type="fig" rid="F7">7</xref>
, where the LDA method surpasses the RDP score with 50 bp fragments, with an increase of 11%, and, if we consider the estimated scores of RDP at 25 bp, the performance increment is about 140%. Furthermore in this chart we can observe how the performance decrease of the LDA approach is very smooth, while the RDP classifier shows a rapid decrease, with a performance drop with respect to ultra short sequences (50, 40 and 25 bp).</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption>
<p>
<bold>Precision scores at phylum level</bold>
. Precision scores, defined as
<italic>true positives</italic>
/(
<italic>true positives </italic>
+
<italic>false positives</italic>
), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp, 50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database Project (RDP) and Support Vector Machine (SVM) classifiers at phylum taxonomic rank. The dashed line for the RDP curve represents extrapolated values.</p>
</caption>
<graphic xlink:href="1471-2105-16-S6-S2-4"></graphic>
</fig>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption>
<p>
<bold>Precision scores at class level</bold>
. Precision scores, defined as
<italic>true positives</italic>
/(
<italic>true positives </italic>
+
<italic>false positives</italic>
), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp, 50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database Project (RDP) and Support Vector Machine (SVM) classifiers at class taxonomic rank. The dashed line for the RDP curve represents extrapolated values.</p>
</caption>
<graphic xlink:href="1471-2105-16-S6-S2-5"></graphic>
</fig>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption>
<p>
<bold>Precision scores at order level</bold>
. Precision scores, defined as
<italic>true positives</italic>
/(
<italic>true positives </italic>
+
<italic>false positives</italic>
), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp, 50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database Project (RDP) and Support Vector Machine (SVM) classifiers at order taxonomic rank. The dashed line for the RDP curve represents extrapolated values.</p>
</caption>
<graphic xlink:href="1471-2105-16-S6-S2-6"></graphic>
</fig>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption>
<p>
<bold>Precision scores at family level</bold>
. Precision scores, defined as
<italic>true positives</italic>
/(
<italic>true positives </italic>
+
<italic>false positives</italic>
), trends as a function of the sequence size (full length, 400 bp, 200 bp, 150 bp, 50 bp, 40 bp, 25 bp), for the Latent Dirichlet Allocation (LDA), Ribosomal Database Project (RDP) and Support Vector Machine (SVM) classifiers at family taxonomic rank. The dashed line for the RDP curve represents extrapolated values.</p>
</caption>
<graphic xlink:href="1471-2105-16-S6-S2-7"></graphic>
</fig>
</sec>
</sec>
<sec sec-type="conclusions">
<title>Conclusion</title>
<p>In this paper we presented a novel computational approach for gene sequence classification. Using the probabilistic topic models, mainly adopted in text mining applications, we developed a pipeline that, by means of the Latent Dirichlet Allocation algorithm, is able to learn a probabilistic topic model from a dataset of 16S gene sequences. Considering each genomic sequence as a document, our goal is to extract the topics, that are recurring meaningful themes, from the training sequence dataset. On the basis of their topic distributions, our aim is to demonstrate that sequences sharing the same groups of high probable topics belong to the same taxonomic ranks. Classification results, in terms of precision scores, have been compared with the RDP classifier, representing state of the art sequence classifier, and with the SVM general purpose classifier. Experiments were carried out at different taxonomic levels, from phylum to family, and for different sequence sizes, from full length down to 25 bp. The results show our approach reached very similar results, within a 10% spread, compared to RDP and SVM, at every taxonomic level and for full length sequences. The most interesting results were obtained considering the robustness and generalization ability of our method with regards to short sized sequences (from 400 bp to 25 bp). Our approach, therefore, proved very reliable considering full length sequences, with precision scores very close to the ones obtained with RDP and SVM classifiers. Most importantly, it demonstrated its high robustness, with a smooth decrease of performances when applied for classification of ultra short sequences. In the near future, we want to further validate our approach by considering noisy sequences, i.e. "not good" according to RDP repository parameters, and taking into account sequence fragments extracted from different parts of the original sequences. Noisy sequences are interesting because for example in case of environmental species it is possible to obtain degraded sequences. The study of several fragments of the same input sequence can allow us to understand which part of the original sequence carries the most informative content.</p>
</sec>
<sec>
<title>Competing interests</title>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<title>Authors' contributions</title>
<p>MLR: project conception, implementation, experimental tests, writing, assessment, discussions.</p>
<p>AF: project conception, discussions, assessment, writing.</p>
<p>RR: project conception, discussions, assessment, writing.</p>
<p>AU: project conception, discussions, assessment, writing, funding.</p>
<p>All authors read and approved the final manuscript.</p>
</sec>
</body>
<back>
<sec>
<title>Declarations</title>
<p>The publication costs for this article were funded by the CNR Interomics Flagship Project "- Development of an integrated platform for the application of "omic" sciences to biomarker definition and theranostic, predictive and diagnostic profiles".</p>
<p>This article has been published as part of
<italic>BMC Bioinformatics </italic>
Volume 16 Supplement 6, 2015: Selected articles from the 10th International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics. The full contents of the supplement are available online at
<ext-link ext-link-type="uri" xlink:href="http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S6">http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S6</ext-link>
.</p>
</sec>
<ref-list>
<ref id="B1">
<mixed-citation publication-type="journal">
<name>
<surname>Drancourt</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Raoult</surname>
<given-names>D</given-names>
</name>
<article-title>Sequence-based identification of new bacteria: a proposition for creation of an orphan bacterium repository</article-title>
<source>J Clin Microbiol</source>
<year>2005</year>
<volume>43</volume>
<issue>9</issue>
<fpage>4311</fpage>
<lpage>4315</lpage>
<pub-id pub-id-type="doi">10.1128/JCM.43.9.4311-4315.2005</pub-id>
<pub-id pub-id-type="pmid">16145070</pub-id>
</mixed-citation>
</ref>
<ref id="B2">
<mixed-citation publication-type="journal">
<name>
<surname>Gaston</surname>
<given-names>KJ</given-names>
</name>
<article-title>Global patterns in biodiversity</article-title>
<source>Nature</source>
<year>2000</year>
<volume>405</volume>
<issue>6783</issue>
<fpage>220</fpage>
<lpage>7</lpage>
<pub-id pub-id-type="doi">10.1038/35012228</pub-id>
<pub-id pub-id-type="pmid">10821282</pub-id>
</mixed-citation>
</ref>
<ref id="B3">
<mixed-citation publication-type="journal">
<name>
<surname>Drancourt</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Berger</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Raoult</surname>
<given-names>D</given-names>
</name>
<article-title>Systematic 16S rRNA Gene Sequencing of Atypical Clinical Isolates Identified 27 New Bacterial Species Associated with Humans</article-title>
<source>Journal of Clinical Microbiology</source>
<year>2004</year>
<volume>42</volume>
<issue>5</issue>
<fpage>2197</fpage>
<lpage>2202</lpage>
<pub-id pub-id-type="doi">10.1128/JCM.42.5.2197-2202.2004</pub-id>
<pub-id pub-id-type="pmid">15131188</pub-id>
</mixed-citation>
</ref>
<ref id="B4">
<mixed-citation publication-type="journal">
<name>
<surname>Hebert</surname>
<given-names>PDN</given-names>
</name>
<name>
<surname>Ratnasingham</surname>
<given-names>S</given-names>
</name>
<name>
<surname>DeWaard</surname>
<given-names>JR</given-names>
</name>
<article-title>Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species</article-title>
<source>Proceedings of the Royal Society Series B, Biological sciences</source>
<year>2003</year>
<volume>270</volume>
<issue>Suppl 1</issue>
<fpage>96</fpage>
<lpage>99</lpage>
</mixed-citation>
</ref>
<ref id="B5">
<mixed-citation publication-type="book">
<name>
<surname>Nei</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Kumar</surname>
<given-names>MD</given-names>
</name>
<source>Molecular Evolution and Phylogenetics</source>
<year>2000</year>
<publisher-name>Oxford University Press, New York</publisher-name>
</mixed-citation>
</ref>
<ref id="B6">
<mixed-citation publication-type="book">
<name>
<surname>La Rosa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Di Fatta</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Gaglio</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Giammanco</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Rizzo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Urso</surname>
<given-names>A</given-names>
</name>
<article-title>Soft topographic map for clustering and classification of bacteria</article-title>
<source>Advances in Intelligent Data Analysis VII Lecture Notes in Computer Science</source>
<year>2007</year>
<volume>4723</volume>
<publisher-name>Springer, Berlin, Heidelberg</publisher-name>
<fpage>332</fpage>
<lpage>343</lpage>
</mixed-citation>
</ref>
<ref id="B7">
<mixed-citation publication-type="journal">
<name>
<surname>La Rosa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rizzo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Urso</surname>
<given-names>A</given-names>
</name>
<article-title>Soft Topographic Maps for Clustering and Classifying Bacteria Using Housekeeping Genes</article-title>
<source>Advances in Artificial Neural Systems</source>
<year>2011</year>
<volume>2011</volume>
</mixed-citation>
</ref>
<ref id="B8">
<mixed-citation publication-type="book">
<name>
<surname>La Rosa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rizzo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Urso</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Gaglio</surname>
<given-names>S</given-names>
</name>
<article-title>Comparison of genomic sequences clustering using normalized compression distance and evolutionary distance</article-title>
<source>Knowledge-Based Intelligent Information and Engineering Systems Lecture Notes in Computer Science</source>
<year>2008</year>
<volume>5179</volume>
<publisher-name>Springer, Berlin, Heidelberg</publisher-name>
<fpage>740</fpage>
<lpage>746</lpage>
</mixed-citation>
</ref>
<ref id="B9">
<mixed-citation publication-type="journal">
<name>
<surname>La Rosa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Gaglio</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Rizzo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Urso</surname>
<given-names>A</given-names>
</name>
<article-title>Normalised compression distance and evolutionary distance of genomic sequences: comparison of clustering results</article-title>
<source>International Journal of Knowledge Engineering and Soft Data Paradigms</source>
<year>2009</year>
<volume>1</volume>
<issue>4</issue>
<fpage>345</fpage>
<lpage>362</lpage>
<pub-id pub-id-type="doi">10.1504/IJKESDP.2009.028987</pub-id>
</mixed-citation>
</ref>
<ref id="B10">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Vitanyi</surname>
<given-names>PMB</given-names>
</name>
<article-title>The similarity metric</article-title>
<source>IEEE Transactions on Information Theory</source>
<year>2004</year>
<volume>50</volume>
<issue>12</issue>
<fpage>3250</fpage>
<lpage>3264</lpage>
<pub-id pub-id-type="doi">10.1109/TIT.2004.838101</pub-id>
</mixed-citation>
</ref>
<ref id="B11">
<mixed-citation publication-type="book">
<name>
<surname>La Rosa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fiannaca</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Rizzo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Urso</surname>
<given-names>A</given-names>
</name>
<person-group person-group-type="editor">Peterson, L.E., Masulli, F., Russo, G</person-group>
<article-title>A Study of Compression-Based Methods for the Analysis of Barcode Sequences</article-title>
<source>Computational Intelligence Methods for Bioinformatics and Biostatistics Lecture Notes in Computer Science</source>
<year>2013</year>
<volume>7845</volume>
<publisher-name>Springer, Berlin, Heidelberg</publisher-name>
<fpage>105</fpage>
<lpage>116</lpage>
<pub-id pub-id-type="doi">10.1007/978-3-642-38342-7_10</pub-id>
</mixed-citation>
</ref>
<ref id="B12">
<mixed-citation publication-type="journal">
<name>
<surname>La Rosa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fiannaca</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Rizzo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Urso</surname>
<given-names>A</given-names>
</name>
<article-title>Alignment-free analysis of barcode sequences by means of compression-based methods</article-title>
<source>BMC Bioinformatics</source>
<year>2013</year>
<volume>14</volume>
<issue>Suppl 7</issue>
<fpage>S4</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-14-S7-S4</pub-id>
<pub-id pub-id-type="pmid">23815444</pub-id>
</mixed-citation>
</ref>
<ref id="B13">
<mixed-citation publication-type="journal">
<name>
<surname>Chor</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Horn</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Goldman</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Levy</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Massingham</surname>
<given-names>T</given-names>
</name>
<article-title>Genomic DNA k-mer spectra: models and modalities</article-title>
<source>Genome biology</source>
<year>2009</year>
<volume>10</volume>
<issue>10</issue>
<fpage>R108</fpage>
<pub-id pub-id-type="doi">10.1186/gb-2009-10-10-r108</pub-id>
<pub-id pub-id-type="pmid">19814784</pub-id>
</mixed-citation>
</ref>
<ref id="B14">
<mixed-citation publication-type="book">
<name>
<surname>Scholkopf</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Smola</surname>
<given-names>AJ</given-names>
</name>
<source>Learning with Kernels</source>
<year>2002</year>
<publisher-name>MIT Press, Cambridge</publisher-name>
</mixed-citation>
</ref>
<ref id="B15">
<mixed-citation publication-type="book">
<name>
<surname>Kuksa</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pavlovic</surname>
<given-names>V</given-names>
</name>
<person-group person-group-type="editor">Giancarlo, R., Hannenhalli, S</person-group>
<article-title>Fast Kernel Methods for SVM Sequence Classifiers</article-title>
<source>Algorithms in Bioinformatics Lecture Notes in Computer Science</source>
<year>2007</year>
<volume>4645</volume>
<publisher-name>Springer, Berlin, Heidelberg</publisher-name>
<fpage>228</fpage>
<lpage>239</lpage>
<pub-id pub-id-type="doi">10.1007/978-3-540-74126-8_22</pub-id>
</mixed-citation>
</ref>
<ref id="B16">
<mixed-citation publication-type="journal">
<name>
<surname>Kuksa</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pavlovic</surname>
<given-names>V</given-names>
</name>
<article-title>Efficient alignment-free DNA barcode analytics</article-title>
<source>BMC Bioinformatics</source>
<year>2009</year>
<volume>10</volume>
<issue>Suppl 14</issue>
<fpage>S9</fpage>
<pub-id pub-id-type="doi">10.1186/1471-2105-10-S14-S9</pub-id>
<pub-id pub-id-type="pmid">19900305</pub-id>
</mixed-citation>
</ref>
<ref id="B17">
<mixed-citation publication-type="journal">
<name>
<surname>Martinetz</surname>
<given-names>TM</given-names>
</name>
<name>
<surname>Berkovich</surname>
<given-names>SG</given-names>
</name>
<name>
<surname>Schulten</surname>
<given-names>KJ</given-names>
</name>
<article-title>"Neural-gas" network for vector quantization and its application to time-series prediction</article-title>
<source>IEEE transactions on neural networks</source>
<year>1993</year>
<volume>4</volume>
<issue>4</issue>
<fpage>558</fpage>
<lpage>569</lpage>
<pub-id pub-id-type="doi">10.1109/72.238311</pub-id>
<pub-id pub-id-type="pmid">18267757</pub-id>
</mixed-citation>
</ref>
<ref id="B18">
<mixed-citation publication-type="book">
<name>
<surname>Fiannaca</surname>
<given-names>A</given-names>
</name>
<name>
<surname>La Rosa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Rizzo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Urso</surname>
<given-names>A</given-names>
</name>
<person-group person-group-type="editor">Iliadis, L., Papadopoulos, H., Jayne, C</person-group>
<article-title>Analysis of DNA Barcode Sequences Using Neural Gas and Spectral Representation</article-title>
<source>Engineering Applications of Neural Networks Communications in Computer and Information Science</source>
<year>2013</year>
<volume>384</volume>
<publisher-name>Springer, Berlin, Heidelberg</publisher-name>
<fpage>212</fpage>
<lpage>221</lpage>
<pub-id pub-id-type="doi">10.1007/978-3-642-41016-1_23</pub-id>
</mixed-citation>
</ref>
<ref id="B19">
<mixed-citation publication-type="journal">
<name>
<surname>Sandberg</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Winberg</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Bränden</surname>
<given-names>C.-i</given-names>
</name>
<name>
<surname>Kaske</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Ernberg</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Cöster</surname>
<given-names>J</given-names>
</name>
<article-title>Capturing Whole-Genome Characteristics in Short Sequences Using a Naïve Bayesian Classifier</article-title>
<source>Genome Research</source>
<year>2001</year>
<volume>11</volume>
<fpage>1404</fpage>
<lpage>1409</lpage>
<pub-id pub-id-type="doi">10.1101/gr.186401</pub-id>
<pub-id pub-id-type="pmid">11483581</pub-id>
</mixed-citation>
</ref>
<ref id="B20">
<mixed-citation publication-type="journal">
<name>
<surname>Wang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Garrity</surname>
<given-names>GM</given-names>
</name>
<name>
<surname>Tiedje</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Cole</surname>
<given-names>JR</given-names>
</name>
<article-title>Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy</article-title>
<source>Applied and environmental microbiology</source>
<year>2007</year>
<volume>73</volume>
<issue>16</issue>
<fpage>5261</fpage>
<lpage>5267</lpage>
<pub-id pub-id-type="doi">10.1128/AEM.00062-07</pub-id>
<pub-id pub-id-type="pmid">17586664</pub-id>
</mixed-citation>
</ref>
<ref id="B21">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>DeSantis</surname>
<given-names>TZ</given-names>
</name>
<name>
<surname>Andersen</surname>
<given-names>GL</given-names>
</name>
<name>
<surname>Knight</surname>
<given-names>R</given-names>
</name>
<article-title>Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers</article-title>
<source>Nucleic acids research</source>
<year>2008</year>
<volume>36</volume>
<issue>18</issue>
<fpage>e120</fpage>
<pub-id pub-id-type="doi">10.1093/nar/gkn491</pub-id>
<pub-id pub-id-type="pmid">18723574</pub-id>
</mixed-citation>
</ref>
<ref id="B22">
<mixed-citation publication-type="journal">
<name>
<surname>DeSantis</surname>
<given-names>TZ</given-names>
</name>
<name>
<surname>Keller</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Karaoz</surname>
<given-names>U</given-names>
</name>
<name>
<surname>Alekseyenko</surname>
<given-names>AV</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>NNS</given-names>
</name>
<name>
<surname>Brodie</surname>
<given-names>EL</given-names>
</name>
<name>
<surname>Pei</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Andersen</surname>
<given-names>GL</given-names>
</name>
<name>
<surname>Larsen</surname>
<given-names>N</given-names>
</name>
<article-title>Simrank: Rapid and sensitive general-purpose k-mer search tool</article-title>
<source>BMC Ecology</source>
<year>2011</year>
<volume>11</volume>
<fpage>11</fpage>
<pub-id pub-id-type="doi">10.1186/1472-6785-11-11</pub-id>
<pub-id pub-id-type="pmid">21524302</pub-id>
</mixed-citation>
</ref>
<ref id="B23">
<mixed-citation publication-type="journal">
<name>
<surname>Domingos</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Pazzani</surname>
<given-names>M</given-names>
</name>
<article-title>On the optimality of the simple Bayesian classifier under zero-one loss</article-title>
<source>Machine Learning</source>
<year>1997</year>
<volume>29</volume>
<issue>2-3</issue>
<fpage>103</fpage>
<lpage>130</lpage>
</mixed-citation>
</ref>
<ref id="B24">
<mixed-citation publication-type="book">
<name>
<surname>Steyvers</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Griffiths</surname>
<given-names>T</given-names>
</name>
<person-group person-group-type="editor">Landauer, T., McNamara, D.S., Dennis, S., Kintsch, W</person-group>
<article-title>Probabilistic Topic Models</article-title>
<source>Handbook of Latent Semantic Analysis</source>
<year>2007</year>
<publisher-name>Erlbaum, Hillsdale, NJ</publisher-name>
</mixed-citation>
</ref>
<ref id="B25">
<mixed-citation publication-type="book">
<name>
<surname>Perona</surname>
<given-names>P</given-names>
</name>
<article-title>A Bayesian Hierarchical Model for Learning Natural Scene Categories</article-title>
<source>IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)</source>
<year>2005</year>
<volume>2</volume>
<publisher-name>IEEE</publisher-name>
<fpage>524</fpage>
<lpage>531</lpage>
</mixed-citation>
</ref>
<ref id="B26">
<mixed-citation publication-type="journal">
<name>
<surname>Bart</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Welling</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Perona</surname>
<given-names>P</given-names>
</name>
<article-title>Unsupervised organization of image collections: taxonomies and beyond</article-title>
<source>IEEE transactions on pattern analysis and machine intelligence</source>
<year>2011</year>
<volume>33</volume>
<issue>11</issue>
<fpage>2302</fpage>
<lpage>2315</lpage>
<pub-id pub-id-type="pmid">21519098</pub-id>
</mixed-citation>
</ref>
<ref id="B27">
<mixed-citation publication-type="book">
<name>
<surname>Blei</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>MI</given-names>
</name>
<article-title>Modeling annotated data</article-title>
<source>Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval - SIGIR '03</source>
<year>2003</year>
<volume>127</volume>
<publisher-name>ACM Press, New York, New York, USA</publisher-name>
</mixed-citation>
</ref>
<ref id="B28">
<mixed-citation publication-type="other">
<name>
<surname>Hu</surname>
<given-names>DJ</given-names>
</name>
<name>
<surname>Saul</surname>
<given-names>LK</given-names>
</name>
<article-title>A probabilistic topic model for unsupervised learning of musical key-profiles</article-title>
<source>10th International Society for Music Information Retrieval Conference (ISMIR 2009)</source>
<year>2009</year>
<fpage>441</fpage>
<lpage>446</lpage>
</mixed-citation>
</ref>
<ref id="B29">
<mixed-citation publication-type="other">
<name>
<surname>Kim</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Narayanan</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Sundaram</surname>
<given-names>S</given-names>
</name>
<article-title>Acoustic topic model for audio information retrieval</article-title>
<source>2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics</source>
<year>2009</year>
<fpage>37</fpage>
<lpage>40</lpage>
</mixed-citation>
</ref>
<ref id="B30">
<mixed-citation publication-type="journal">
<name>
<surname>Falush</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Stephens</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Pritchard</surname>
<given-names>JK</given-names>
</name>
<article-title>Inference of Population Structure Using Multilocus Genotype Data: Linked Loci and Correlated Allele Frequencies</article-title>
<source>Genetics</source>
<year>2003</year>
<volume>164</volume>
<issue>4</issue>
<fpage>1567</fpage>
<lpage>1587</lpage>
<pub-id pub-id-type="pmid">12930761</pub-id>
</mixed-citation>
</ref>
<ref id="B31">
<mixed-citation publication-type="book">
<name>
<surname>Pinoli</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Chicco</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Masseroli</surname>
<given-names>M</given-names>
</name>
<article-title>Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations</article-title>
<source>13th IEEE International Conference on BioInformatics and BioEngineering</source>
<year>2013</year>
<publisher-name>IEEE, Los Alamitos, CA, USA</publisher-name>
<fpage>1</fpage>
<lpage>4</lpage>
</mixed-citation>
</ref>
<ref id="B32">
<mixed-citation publication-type="book">
<name>
<surname>Masseroli</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Chicco</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Pinoli</surname>
<given-names>P</given-names>
</name>
<article-title>Probabilistic Latent Semantic Analysis for prediction of Gene Ontology annotations</article-title>
<source>The 2012 International Joint Conference on Neural Networks (IJCNN)</source>
<year>2012</year>
<publisher-name>IEEE, Brisbane, QLD</publisher-name>
<fpage>1</fpage>
<lpage>8</lpage>
</mixed-citation>
</ref>
<ref id="B33">
<mixed-citation publication-type="book">
<name>
<surname>Hofmann</surname>
<given-names>T</given-names>
</name>
<article-title>Probabilistic latent semantic indexing</article-title>
<source>Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '99</source>
<year>1999</year>
<publisher-name>ACM Press, New York, New York, USA</publisher-name>
<fpage>50</fpage>
<lpage>57</lpage>
</mixed-citation>
</ref>
<ref id="B34">
<mixed-citation publication-type="journal">
<name>
<surname>Blei</surname>
<given-names>DM</given-names>
</name>
<article-title>Probabilistic Topic Models</article-title>
<source>Communication of the ACM</source>
<year>2012</year>
<volume>55</volume>
<issue>4</issue>
<fpage>77</fpage>
<lpage>84</lpage>
<pub-id pub-id-type="doi">10.1145/2133806.2133826</pub-id>
</mixed-citation>
</ref>
<ref id="B35">
<mixed-citation publication-type="journal">
<name>
<surname>Griffiths</surname>
<given-names>TL</given-names>
</name>
<name>
<surname>Steyvers</surname>
<given-names>M</given-names>
</name>
<article-title>Finding scientific topics</article-title>
<source>PNAS</source>
<year>2004</year>
<volume>101</volume>
<issue>Suppl 1</issue>
<fpage>5228</fpage>
<lpage>5235</lpage>
<pub-id pub-id-type="pmid">14872004</pub-id>
</mixed-citation>
</ref>
<ref id="B36">
<mixed-citation publication-type="journal">
<name>
<surname>Blei</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>AY</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>MI</given-names>
</name>
<article-title>Latent Dirichlet Allocation</article-title>
<source>J Mach Learn Res</source>
<year>2003</year>
<volume>3</volume>
<fpage>993</fpage>
<lpage>1022</lpage>
</mixed-citation>
</ref>
<ref id="B37">
<mixed-citation publication-type="book">
<name>
<surname>Li</surname>
<given-names>W</given-names>
</name>
<name>
<surname>McCallum</surname>
<given-names>A</given-names>
</name>
<article-title>Pachinko allocation: DAG-structured mixture models of topic correlations</article-title>
<source>Proceedings of the 23rd International Conference on Machine Learning - ICML '06</source>
<year>2006</year>
<publisher-name>ACM Press, New York, New York, USA</publisher-name>
<fpage>577</fpage>
<lpage>584</lpage>
</mixed-citation>
</ref>
<ref id="B38">
<mixed-citation publication-type="journal">
<name>
<surname>Teh</surname>
<given-names>YW</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>MI</given-names>
</name>
<name>
<surname>Beal</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Blei</surname>
<given-names>DM</given-names>
</name>
<article-title>Hierarchical Dirichlet Processes</article-title>
<source>Journal of the American Statistical Association</source>
<year>2006</year>
<volume>101</volume>
<issue>476</issue>
<fpage>1566</fpage>
<lpage>1581</lpage>
<pub-id pub-id-type="doi">10.1198/016214506000000302</pub-id>
</mixed-citation>
</ref>
<ref id="B39">
<mixed-citation publication-type="journal">
<name>
<surname>Grun</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Hornik</surname>
<given-names>K</given-names>
</name>
<article-title>topicmodels: An R Package for Fitting Topic Models</article-title>
<source>Journal of Statistical Software</source>
<year>2011</year>
<volume>40</volume>
<issue>13</issue>
</mixed-citation>
</ref>
<ref id="B40">
<mixed-citation publication-type="journal">
<name>
<surname>Casella</surname>
<given-names>G</given-names>
</name>
<name>
<surname>George</surname>
<given-names>EI</given-names>
</name>
<article-title>Explaining the Gibbs Sampler</article-title>
<source>The American Statistician</source>
<year>1992</year>
<volume>46</volume>
<issue>3</issue>
<fpage>167</fpage>
<lpage>174</lpage>
</mixed-citation>
</ref>
<ref id="B41">
<mixed-citation publication-type="journal">
<name>
<surname>Cole</surname>
<given-names>JR</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Cardenas</surname>
<given-names>E</given-names>
</name>
<name>
<surname>Fish</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chai</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Farris</surname>
<given-names>RJ</given-names>
</name>
<name>
<surname>Kulam-Syed-Mohideen</surname>
<given-names>aS</given-names>
</name>
<name>
<surname>McGarrell</surname>
<given-names>DM</given-names>
</name>
<name>
<surname>Marsh</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Garrity</surname>
<given-names>GM</given-names>
</name>
<name>
<surname>Tiedje</surname>
<given-names>JM</given-names>
</name>
<article-title>The Ribosomal Database Project: improved alignments and new tools for rRNA analysis</article-title>
<source>Nucleic acids research</source>
<year>2009</year>
<volume>37</volume>
<issue>Database issue</issue>
<fpage>D141</fpage>
<lpage>D145</lpage>
<pub-id pub-id-type="pmid">19004872</pub-id>
</mixed-citation>
</ref>
<ref id="B42">
<mixed-citation publication-type="journal">
<name>
<surname>Geer</surname>
<given-names>LY</given-names>
</name>
<name>
<surname>Marchler-Bauer</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Geer</surname>
<given-names>RC</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>L</given-names>
</name>
<name>
<surname>He</surname>
<given-names>J</given-names>
</name>
<name>
<surname>He</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>W</given-names>
</name>
<name>
<surname>Bryant</surname>
<given-names>SH</given-names>
</name>
<article-title>The NCBI BioSystems database</article-title>
<source>Nucleic acids research</source>
<year>2010</year>
<volume>38</volume>
<issue>Database issue</issue>
<fpage>D492</fpage>
<lpage>D496</lpage>
<pub-id pub-id-type="pmid">19854944</pub-id>
</mixed-citation>
</ref>
<ref id="B43">
<mixed-citation publication-type="book">
<name>
<surname>La Rosa</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Fiannaca</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Rizzo</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Urso</surname>
<given-names>A</given-names>
</name>
<article-title>Genomic Sequence Classification using Probabilistic Topic Modeling</article-title>
<source>Computational Intelligence Methods for Bioinformatics and Biostatistics Lecture Notes in Computer Science</source>
<year>2014</year>
<volume>8452</volume>
<publisher-name>Springer, Berlin, Heidelberg</publisher-name>
<fpage>49</fpage>
<lpage>61</lpage>
<pub-id pub-id-type="doi">10.1007/978-3-319-09042-9_4</pub-id>
</mixed-citation>
</ref>
<ref id="B44">
<mixed-citation publication-type="journal">
<name>
<surname>Wooley</surname>
<given-names>JC</given-names>
</name>
<name>
<surname>Godzik</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Friedberg</surname>
<given-names>I</given-names>
</name>
<article-title>A primer on metagenomics</article-title>
<source>PLoS Computat Biol</source>
<year>2010</year>
<volume>6</volume>
<issue>2</issue>
<fpage>e1000667</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1000667</pub-id>
</mixed-citation>
</ref>
<ref id="B45">
<mixed-citation publication-type="journal">
<name>
<surname>Karatzoglou</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Meyer</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Hornik</surname>
<given-names>K</given-names>
</name>
<article-title>Support Vector Machines in R</article-title>
<source>Journal of Statistical Software</source>
<year>2006</year>
<volume>15</volume>
<issue>9</issue>
<fpage>1</fpage>
<lpage>28</lpage>
</mixed-citation>
</ref>
<ref id="B46">
<mixed-citation publication-type="journal">
<name>
<surname>Chang</surname>
<given-names>CC</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>CJ</given-names>
</name>
<article-title>LIBSVM: A library for support vector machines</article-title>
<source>ACM Transactions on Intelligent Systems and Technology</source>
<year>2011</year>
<volume>2</volume>
<issue>3</issue>
<fpage>1</fpage>
<lpage>27</lpage>
</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000951  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000951  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021