Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses

Identifieur interne : 000F88 ( Pmc/Curation ); précédent : 000F87; suivant : 000F89

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses

Auteurs : Stephen Woloszynek [États-Unis] ; Zhengqiao Zhao [États-Unis] ; Jian Chen [États-Unis] ; Gail L. Rosen [États-Unis]

Source :

RBID : PMC:6407789

Abstract

Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding (“embedding”) each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.


Url:
DOI: 10.1371/journal.pcbi.1006721
PubMed: 30807567
PubMed Central: 6407789

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:6407789

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses</title>
<author>
<name sortKey="Woloszynek, Stephen" sort="Woloszynek, Stephen" uniqKey="Woloszynek S" first="Stephen" last="Woloszynek">Stephen Woloszynek</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Zhao, Zhengqiao" sort="Zhao, Zhengqiao" uniqKey="Zhao Z" first="Zhengqiao" last="Zhao">Zhengqiao Zhao</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Chen, Jian" sort="Chen, Jian" uniqKey="Chen J" first="Jian" last="Chen">Jian Chen</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Rosen, Gail L" sort="Rosen, Gail L" uniqKey="Rosen G" first="Gail L." last="Rosen">Gail L. Rosen</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">30807567</idno>
<idno type="pmc">6407789</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6407789</idno>
<idno type="RBID">PMC:6407789</idno>
<idno type="doi">10.1371/journal.pcbi.1006721</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000F88</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000F88</idno>
<idno type="wicri:Area/Pmc/Curation">000F88</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000F88</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses</title>
<author>
<name sortKey="Woloszynek, Stephen" sort="Woloszynek, Stephen" uniqKey="Woloszynek S" first="Stephen" last="Woloszynek">Stephen Woloszynek</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Zhao, Zhengqiao" sort="Zhao, Zhengqiao" uniqKey="Zhao Z" first="Zhengqiao" last="Zhao">Zhengqiao Zhao</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Chen, Jian" sort="Chen, Jian" uniqKey="Chen J" first="Jian" last="Chen">Jian Chen</name>
<affiliation wicri:level="1">
<nlm:aff id="aff002">
<addr-line>Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Rosen, Gail L" sort="Rosen, Gail L" uniqKey="Rosen G" first="Gail L." last="Rosen">Gail L. Rosen</name>
<affiliation wicri:level="1">
<nlm:aff id="aff001">
<addr-line>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS Computational Biology</title>
<idno type="ISSN">1553-734X</idno>
<idno type="eISSN">1553-7358</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure
<italic>in situ</italic>
. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding (“embedding”) each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed
<italic>k</italic>
-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as
<italic>k</italic>
-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the
<italic>k</italic>
-mer embedding space captured distinct
<italic>k</italic>
-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Gevers, D" uniqKey="Gevers D">D Gevers</name>
</author>
<author>
<name sortKey="Kugathasan, S" uniqKey="Kugathasan S">S Kugathasan</name>
</author>
<author>
<name sortKey="Denson, L" uniqKey="Denson L">L Denson</name>
</author>
<author>
<name sortKey="Vazquez Baeza, Y" uniqKey="Vazquez Baeza Y">Y Vázquez-Baeza</name>
</author>
<author>
<name sortKey="Van Treuren, W" uniqKey="Van Treuren W">W Van Treuren</name>
</author>
<author>
<name sortKey="Ren, B" uniqKey="Ren B">B Ren</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Schmidt, Bl" uniqKey="Schmidt B">BL Schmidt</name>
</author>
<author>
<name sortKey="Kuczynski, J" uniqKey="Kuczynski J">J Kuczynski</name>
</author>
<author>
<name sortKey="Bhattacharya, A" uniqKey="Bhattacharya A">A Bhattacharya</name>
</author>
<author>
<name sortKey="Huey, B" uniqKey="Huey B">B Huey</name>
</author>
<author>
<name sortKey="Corby, Pm" uniqKey="Corby P">PM Corby</name>
</author>
<author>
<name sortKey="Queiroz, El" uniqKey="Queiroz E">EL Queiroz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Woloszynek, S" uniqKey="Woloszynek S">S Woloszynek</name>
</author>
<author>
<name sortKey="Pastor, S" uniqKey="Pastor S">S Pastor</name>
</author>
<author>
<name sortKey="Mell, Jc" uniqKey="Mell J">JC Mell</name>
</author>
<author>
<name sortKey="Nandi, N" uniqKey="Nandi N">N Nandi</name>
</author>
<author>
<name sortKey="Sokhansanj, B" uniqKey="Sokhansanj B">B Sokhansanj</name>
</author>
<author>
<name sortKey="Rosen, Gl" uniqKey="Rosen G">GL Rosen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Henry, S" uniqKey="Henry S">S Henry</name>
</author>
<author>
<name sortKey="Baudoin, E" uniqKey="Baudoin E">E Baudoin</name>
</author>
<author>
<name sortKey="L Pez Gutierrez, Jc" uniqKey="L Pez Gutierrez J">JC López-Gutiérrez</name>
</author>
<author>
<name sortKey="Martin Laurent, F" uniqKey="Martin Laurent F">F Martin-Laurent</name>
</author>
<author>
<name sortKey="Brauman, A" uniqKey="Brauman A">A Brauman</name>
</author>
<author>
<name sortKey="Philippot, L" uniqKey="Philippot L">L Philippot</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Okano, Y" uniqKey="Okano Y">Y Okano</name>
</author>
<author>
<name sortKey="Hristova, Kr" uniqKey="Hristova K">KR Hristova</name>
</author>
<author>
<name sortKey="Leutenegger, Cm" uniqKey="Leutenegger C">CM Leutenegger</name>
</author>
<author>
<name sortKey="Jackson, Le" uniqKey="Jackson L">LE Jackson</name>
</author>
<author>
<name sortKey="Denison, Rf" uniqKey="Denison R">RF Denison</name>
</author>
<author>
<name sortKey="Gebreyesus, B" uniqKey="Gebreyesus B">B Gebreyesus</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sunagawa, S" uniqKey="Sunagawa S">S Sunagawa</name>
</author>
<author>
<name sortKey="Coelho, Lp" uniqKey="Coelho L">LP Coelho</name>
</author>
<author>
<name sortKey="Chaffron, S" uniqKey="Chaffron S">S Chaffron</name>
</author>
<author>
<name sortKey="Kultima, Jr" uniqKey="Kultima J">JR Kultima</name>
</author>
<author>
<name sortKey="Labadie, K" uniqKey="Labadie K">K Labadie</name>
</author>
<author>
<name sortKey="Salazar, G" uniqKey="Salazar G">G Salazar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="De Vos, Wm" uniqKey="De Vos W">WM de Vos</name>
</author>
<author>
<name sortKey="De Vos, Eaj" uniqKey="De Vos E">EAJ De Vos</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ni, J" uniqKey="Ni J">J Ni</name>
</author>
<author>
<name sortKey="Wu, Gd" uniqKey="Wu G">GD Wu</name>
</author>
<author>
<name sortKey="Albenberg, L" uniqKey="Albenberg L">L Albenberg</name>
</author>
<author>
<name sortKey="Tomov, Vt" uniqKey="Tomov V">VT Tomov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Saraswati, S" uniqKey="Saraswati S">S Saraswati</name>
</author>
<author>
<name sortKey="Sitaraman, R" uniqKey="Sitaraman R">R Sitaraman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Harley, Itw" uniqKey="Harley I">ITW Harley</name>
</author>
<author>
<name sortKey="Karp, Cl" uniqKey="Karp C">CL Karp</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Caporaso, J" uniqKey="Caporaso J">J Caporaso</name>
</author>
<author>
<name sortKey="Kuczynski, J" uniqKey="Kuczynski J">J Kuczynski</name>
</author>
<author>
<name sortKey="Stombaugh, J" uniqKey="Stombaugh J">J Stombaugh</name>
</author>
<author>
<name sortKey="Bittinger, K" uniqKey="Bittinger K">K Bittinger</name>
</author>
<author>
<name sortKey="Bushman" uniqKey="Bushman">Bushman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Callahan, Bj" uniqKey="Callahan B">BJ Callahan</name>
</author>
<author>
<name sortKey="Mcmurdie, Pj" uniqKey="Mcmurdie P">PJ McMurdie</name>
</author>
<author>
<name sortKey="Rosen, Mj" uniqKey="Rosen M">MJ Rosen</name>
</author>
<author>
<name sortKey="Han, Aw" uniqKey="Han A">AW Han</name>
</author>
<author>
<name sortKey="Johnson, Aja" uniqKey="Johnson A">AJA Johnson</name>
</author>
<author>
<name sortKey="Holmes, Sp" uniqKey="Holmes S">SP Holmes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nguyen, Np" uniqKey="Nguyen N">NP Nguyen</name>
</author>
<author>
<name sortKey="Warnow, T" uniqKey="Warnow T">T Warnow</name>
</author>
<author>
<name sortKey="Pop, M" uniqKey="Pop M">M Pop</name>
</author>
<author>
<name sortKey="White, B" uniqKey="White B">B White</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Callahan, Bj" uniqKey="Callahan B">BJ Callahan</name>
</author>
<author>
<name sortKey="Mcmurdie, Pj" uniqKey="Mcmurdie P">PJ McMurdie</name>
</author>
<author>
<name sortKey="Holmes, Sp" uniqKey="Holmes S">SP Holmes</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mysara, M" uniqKey="Mysara M">M Mysara</name>
</author>
<author>
<name sortKey="Vandamme, P" uniqKey="Vandamme P">P Vandamme</name>
</author>
<author>
<name sortKey="Props, R" uniqKey="Props R">R Props</name>
</author>
<author>
<name sortKey="Kerckhof, Fm" uniqKey="Kerckhof F">FM Kerckhof</name>
</author>
<author>
<name sortKey="Leys, N" uniqKey="Leys N">N Leys</name>
</author>
<author>
<name sortKey="Boon, N" uniqKey="Boon N">N Boon</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Edgar, Rc" uniqKey="Edgar R">RC Edgar</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lan, Y" uniqKey="Lan Y">Y Lan</name>
</author>
<author>
<name sortKey="Morrison, Jc" uniqKey="Morrison J">JC Morrison</name>
</author>
<author>
<name sortKey="Hershberg, R" uniqKey="Hershberg R">R Hershberg</name>
</author>
<author>
<name sortKey="Rosen, Gl" uniqKey="Rosen G">GL Rosen</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nelson, Mc" uniqKey="Nelson M">MC Nelson</name>
</author>
<author>
<name sortKey="Morrison, Hg" uniqKey="Morrison H">HG Morrison</name>
</author>
<author>
<name sortKey="Benjamino, J" uniqKey="Benjamino J">J Benjamino</name>
</author>
<author>
<name sortKey="Grim, Sl" uniqKey="Grim S">SL Grim</name>
</author>
<author>
<name sortKey="Graf, J" uniqKey="Graf J">J Graf</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Golob, Jl" uniqKey="Golob J">JL Golob</name>
</author>
<author>
<name sortKey="Margolis, E" uniqKey="Margolis E">E Margolis</name>
</author>
<author>
<name sortKey="Hoffman, Ng" uniqKey="Hoffman N">NG Hoffman</name>
</author>
<author>
<name sortKey="Fredricks, Dn" uniqKey="Fredricks D">DN Fredricks</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ng, P" uniqKey="Ng P">P Ng</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bengio, Y" uniqKey="Bengio Y">Y Bengio</name>
</author>
<author>
<name sortKey="Senecal, Js" uniqKey="Senecal J">JS Senécal</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wood, De" uniqKey="Wood D">DE Wood</name>
</author>
<author>
<name sortKey="Salzberg, Sl" uniqKey="Salzberg S">SL Salzberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Desantis, Tz" uniqKey="Desantis T">TZ DeSantis</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Larsen, N" uniqKey="Larsen N">N Larsen</name>
</author>
<author>
<name sortKey="Rojas, M" uniqKey="Rojas M">M Rojas</name>
</author>
<author>
<name sortKey="Brodie, El" uniqKey="Brodie E">EL Brodie</name>
</author>
<author>
<name sortKey="Keller, K" uniqKey="Keller K">K Keller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mikolov, T" uniqKey="Mikolov T">T Mikolov</name>
</author>
<author>
<name sortKey="Chen, K" uniqKey="Chen K">K Chen</name>
</author>
<author>
<name sortKey="Corrado, G" uniqKey="Corrado G">G Corrado</name>
</author>
<author>
<name sortKey="Dean, J" uniqKey="Dean J">J Dean</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Johnson, R" uniqKey="Johnson R">R Johnson</name>
</author>
<author>
<name sortKey="Zhang, T" uniqKey="Zhang T">T Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Min, X" uniqKey="Min X">X Min</name>
</author>
<author>
<name sortKey="Zeng, W" uniqKey="Zeng W">W Zeng</name>
</author>
<author>
<name sortKey="Chen, N" uniqKey="Chen N">N Chen</name>
</author>
<author>
<name sortKey="Chen, T" uniqKey="Chen T">T Chen</name>
</author>
<author>
<name sortKey="Jiang, R" uniqKey="Jiang R">R Jiang</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bengio, Y" uniqKey="Bengio Y">Y Bengio</name>
</author>
<author>
<name sortKey=" Lecun, Y" uniqKey=" Lecun Y">Y {LeCun}</name>
</author>
<author>
<name sortKey="Lecun, Y" uniqKey="Lecun Y">Y Lecun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Le, Q" uniqKey="Le Q">Q Le</name>
</author>
<author>
<name sortKey="Mikolov, T" uniqKey="Mikolov T">T Mikolov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bahdanau, D" uniqKey="Bahdanau D">D Bahdanau</name>
</author>
<author>
<name sortKey="Bosc, T" uniqKey="Bosc T">T Bosc</name>
</author>
<author>
<name sortKey="Jastrz Bski, S" uniqKey="Jastrz Bski S">S Jastrzȩbski</name>
</author>
<author>
<name sortKey="Grefenstette, E" uniqKey="Grefenstette E">E Grefenstette</name>
</author>
<author>
<name sortKey="Vincent, P" uniqKey="Vincent P">P Vincent</name>
</author>
<author>
<name sortKey="Bengio, Y" uniqKey="Bengio Y">Y Bengio</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author>
<name sortKey="Chen, X" uniqKey="Chen X">X Chen</name>
</author>
<author>
<name sortKey="Hovy, E" uniqKey="Hovy E">E Hovy</name>
</author>
<author>
<name sortKey="Jurafsky, D" uniqKey="Jurafsky D">D Jurafsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Asgari, E" uniqKey="Asgari E">E Asgari</name>
</author>
<author>
<name sortKey="Mofrad, Mrk" uniqKey="Mofrad M">MRK Mofrad</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Arora, S" uniqKey="Arora S">S Arora</name>
</author>
<author>
<name sortKey="Liang, Y" uniqKey="Liang Y">Y Liang</name>
</author>
<author>
<name sortKey="Ma, T" uniqKey="Ma T">T Ma</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Van Der Maaten, Ljp" uniqKey="Van Der Maaten L">LJP Van Der Maaten</name>
</author>
<author>
<name sortKey="Hinton, Ge" uniqKey="Hinton G">GE Hinton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rognes, T" uniqKey="Rognes T">T Rognes</name>
</author>
<author>
<name sortKey="Flouri, T" uniqKey="Flouri T">T Flouri</name>
</author>
<author>
<name sortKey="Nichols, B" uniqKey="Nichols B">B Nichols</name>
</author>
<author>
<name sortKey="Quince, C" uniqKey="Quince C">C Quince</name>
</author>
<author>
<name sortKey="Mahe, F" uniqKey="Mahe F">F Mahé</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lloyd, Sp" uniqKey="Lloyd S">SP Lloyd</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Friedman, J" uniqKey="Friedman J">J Friedman</name>
</author>
<author>
<name sortKey="Hastie, T" uniqKey="Hastie T">T Hastie</name>
</author>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Tibshirani, R" uniqKey="Tibshirani R">R Tibshirani</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gloor, Gb" uniqKey="Gloor G">GB Gloor</name>
</author>
<author>
<name sortKey="Reid, G" uniqKey="Reid G">G Reid</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Krakovna, V" uniqKey="Krakovna V">V Krakovna</name>
</author>
<author>
<name sortKey="Doshi Velez, F" uniqKey="Doshi Velez F">F Doshi-Velez</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alain, G" uniqKey="Alain G">G Alain</name>
</author>
<author>
<name sortKey="Bengio, Y" uniqKey="Bengio Y">Y Bengio</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Samek, W" uniqKey="Samek W">W Samek</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, B" uniqKey="Kim B">B Kim</name>
</author>
<author>
<name sortKey="Shah, J" uniqKey="Shah J">J Shah</name>
</author>
<author>
<name sortKey="Doshi Velez, F" uniqKey="Doshi Velez F">F Doshi-Velez</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Oh, J" uniqKey="Oh J">J Oh</name>
</author>
<author>
<name sortKey="Conlan, S" uniqKey="Conlan S">S Conlan</name>
</author>
<author>
<name sortKey="Polley, Ec" uniqKey="Polley E">EC Polley</name>
</author>
<author>
<name sortKey="Segre, Ja" uniqKey="Segre J">JA Segre</name>
</author>
<author>
<name sortKey="Kong, Hh" uniqKey="Kong H">HH Kong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Byrd, Al" uniqKey="Byrd A">AL Byrd</name>
</author>
<author>
<name sortKey="Belkaid, Y" uniqKey="Belkaid Y">Y Belkaid</name>
</author>
<author>
<name sortKey="Segre, Ja" uniqKey="Segre J">JA Segre</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ling, Z" uniqKey="Ling Z">Z Ling</name>
</author>
<author>
<name sortKey="Liu, X" uniqKey="Liu X">X Liu</name>
</author>
<author>
<name sortKey="Cheng, Y" uniqKey="Cheng Y">Y Cheng</name>
</author>
<author>
<name sortKey="Jiang, X" uniqKey="Jiang X">X Jiang</name>
</author>
<author>
<name sortKey="Jiang, H" uniqKey="Jiang H">H Jiang</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, H" uniqKey="Chen H">H Chen</name>
</author>
<author>
<name sortKey="Jiang, W" uniqKey="Jiang W">W Jiang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goodrich, Jk" uniqKey="Goodrich J">JK Goodrich</name>
</author>
<author>
<name sortKey="Waters, Jl" uniqKey="Waters J">JL Waters</name>
</author>
<author>
<name sortKey="Poole, Ac" uniqKey="Poole A">AC Poole</name>
</author>
<author>
<name sortKey="Sutter, Jl" uniqKey="Sutter J">JL Sutter</name>
</author>
<author>
<name sortKey="Koren, O" uniqKey="Koren O">O Koren</name>
</author>
<author>
<name sortKey="Blekhman, R" uniqKey="Blekhman R">R Blekhman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dulal, S" uniqKey="Dulal S">S Dulal</name>
</author>
<author>
<name sortKey="Keku, To" uniqKey="Keku T">TO Keku</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Graessler, J" uniqKey="Graessler J">J Graessler</name>
</author>
<author>
<name sortKey="Qin, Y" uniqKey="Qin Y">Y Qin</name>
</author>
<author>
<name sortKey="Zhong, H" uniqKey="Zhong H">H Zhong</name>
</author>
<author>
<name sortKey="Zhang, J" uniqKey="Zhang J">J Zhang</name>
</author>
<author>
<name sortKey="Licinio, J" uniqKey="Licinio J">J Licinio</name>
</author>
<author>
<name sortKey="Wong, Ml" uniqKey="Wong M">ML Wong</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Crielaard, W" uniqKey="Crielaard W">W Crielaard</name>
</author>
<author>
<name sortKey="Zaura, E" uniqKey="Zaura E">E Zaura</name>
</author>
<author>
<name sortKey="Schuller, Aa" uniqKey="Schuller A">AA Schuller</name>
</author>
<author>
<name sortKey="Huse, Sm" uniqKey="Huse S">SM Huse</name>
</author>
<author>
<name sortKey="Montijn, Rc" uniqKey="Montijn R">RC Montijn</name>
</author>
<author>
<name sortKey="Keijser, Bjf" uniqKey="Keijser B">BJF Keijser</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sampaio Maia, B" uniqKey="Sampaio Maia B">B Sampaio-Maia</name>
</author>
<author>
<name sortKey="Monteiro Silva, F" uniqKey="Monteiro Silva F">F Monteiro-Silva</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ching, T" uniqKey="Ching T">T Ching</name>
</author>
<author>
<name sortKey="Himmelstein, Ds" uniqKey="Himmelstein D">DS Himmelstein</name>
</author>
<author>
<name sortKey="Beaulieu Jones, Bk" uniqKey="Beaulieu Jones B">BK Beaulieu-Jones</name>
</author>
<author>
<name sortKey="Kalinin, Aa" uniqKey="Kalinin A">AA Kalinin</name>
</author>
<author>
<name sortKey="Do, Bt" uniqKey="Do B">BT Do</name>
</author>
<author>
<name sortKey="Way, Gp" uniqKey="Way G">GP Way</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Desantis, Tz" uniqKey="Desantis T">TZ DeSantis</name>
</author>
<author>
<name sortKey="Hugenholtz, P" uniqKey="Hugenholtz P">P Hugenholtz</name>
</author>
<author>
<name sortKey="Larsen, N" uniqKey="Larsen N">N Larsen</name>
</author>
<author>
<name sortKey="Rojas, M" uniqKey="Rojas M">M Rojas</name>
</author>
<author>
<name sortKey="Brodie, El" uniqKey="Brodie E">EL Brodie</name>
</author>
<author>
<name sortKey="Keller, K" uniqKey="Keller K">K Keller</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Rehurek, R" uniqKey="Rehurek R">R Rehurek</name>
</author>
<author>
<name sortKey="Sojka, P" uniqKey="Sojka P">P Sojka</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pennington, J" uniqKey="Pennington J">J Pennington</name>
</author>
<author>
<name sortKey="Socher, R" uniqKey="Socher R">R Socher</name>
</author>
<author>
<name sortKey="Manning, Cd" uniqKey="Manning C">CD Manning</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, Q" uniqKey="Wang Q">Q Wang</name>
</author>
<author>
<name sortKey="Garrity, Gm" uniqKey="Garrity G">GM Garrity</name>
</author>
<author>
<name sortKey="Tiedje, Jm" uniqKey="Tiedje J">JM Tiedje</name>
</author>
<author>
<name sortKey="Cole, Jr" uniqKey="Cole J">JR Cole</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wright, Es" uniqKey="Wright E">ES Wright</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS Comput Biol</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS Comput. Biol</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">ploscomp</journal-id>
<journal-title-group>
<journal-title>PLoS Computational Biology</journal-title>
</journal-title-group>
<issn pub-type="ppub">1553-734X</issn>
<issn pub-type="epub">1553-7358</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, CA USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">30807567</article-id>
<article-id pub-id-type="pmc">6407789</article-id>
<article-id pub-id-type="publisher-id">PCOMPBIOL-D-18-00713</article-id>
<article-id pub-id-type="doi">10.1371/journal.pcbi.1006721</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence Analysis</subject>
<subj-group>
<subject>Sequence Alignment</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Computational Techniques</subject>
<subj-group>
<subject>Split-Decomposition Method</subject>
<subj-group>
<subject>Multiple Alignment Calculation</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Database and informatics methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence analysis</subject>
<subj-group>
<subject>DNA sequence analysis</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Biological Databases</subject>
<subj-group>
<subject>Sequence Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence Analysis</subject>
<subj-group>
<subject>Sequence Databases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Database and Informatics Methods</subject>
<subj-group>
<subject>Bioinformatics</subject>
<subj-group>
<subject>Sequence Analysis</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Molecular Biology</subject>
<subj-group>
<subject>Molecular Biology Techniques</subject>
<subj-group>
<subject>Sequencing Techniques</subject>
<subj-group>
<subject>Nucleotide Sequencing</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and Analysis Methods</subject>
<subj-group>
<subject>Molecular Biology Techniques</subject>
<subj-group>
<subject>Sequencing Techniques</subject>
<subj-group>
<subject>Nucleotide Sequencing</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Biochemistry</subject>
<subj-group>
<subject>Nucleic acids</subject>
<subj-group>
<subject>RNA</subject>
<subj-group>
<subject>Non-coding RNA</subject>
<subj-group>
<subject>Ribosomal RNA</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Biochemistry</subject>
<subj-group>
<subject>Ribosomes</subject>
<subj-group>
<subject>Ribosomal RNA</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Cell biology</subject>
<subj-group>
<subject>Cellular structures and organelles</subject>
<subj-group>
<subject>Ribosomes</subject>
<subj-group>
<subject>Ribosomal RNA</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and Life Sciences</subject>
<subj-group>
<subject>Taxonomy</subject>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Computer and Information Sciences</subject>
<subj-group>
<subject>Data Management</subject>
<subj-group>
<subject>Taxonomy</subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses</article-title>
<alt-title alt-title-type="running-head">16S rRNA sequence embeddings</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0003-0568-298X</contrib-id>
<name>
<surname>Woloszynek</surname>
<given-names>Stephen</given-names>
</name>
<role content-type="http://credit.casrai.org/">Conceptualization</role>
<role content-type="http://credit.casrai.org/">Data curation</role>
<role content-type="http://credit.casrai.org/">Formal analysis</role>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Supervision</role>
<role content-type="http://credit.casrai.org/">Validation</role>
<role content-type="http://credit.casrai.org/">Visualization</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhao</surname>
<given-names>Zhengqiao</given-names>
</name>
<role content-type="http://credit.casrai.org/">Formal analysis</role>
<role content-type="http://credit.casrai.org/">Methodology</role>
<role content-type="http://credit.casrai.org/">Validation</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chen</surname>
<given-names>Jian</given-names>
</name>
<role content-type="http://credit.casrai.org/">Formal analysis</role>
<role content-type="http://credit.casrai.org/">Writing – original draft</role>
<xref ref-type="aff" rid="aff002">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id authenticated="true" contrib-id-type="orcid">http://orcid.org/0000-0003-1763-5750</contrib-id>
<name>
<surname>Rosen</surname>
<given-names>Gail L.</given-names>
</name>
<role content-type="http://credit.casrai.org/">Funding acquisition</role>
<role content-type="http://credit.casrai.org/">Resources</role>
<role content-type="http://credit.casrai.org/">Supervision</role>
<role content-type="http://credit.casrai.org/">Writing – review & editing</role>
<xref ref-type="aff" rid="aff001">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="cor001">*</xref>
</contrib>
</contrib-group>
<aff id="aff001">
<label>1</label>
<addr-line>Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America</addr-line>
</aff>
<aff id="aff002">
<label>2</label>
<addr-line>Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Langille</surname>
<given-names>Morgan</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>DAL, CANADA</addr-line>
</aff>
<author-notes>
<fn fn-type="COI-statement" id="coi001">
<p>The authors have declared that no competing interests exist.</p>
</fn>
<corresp id="cor001">* E-mail:
<email>glr26@drexel.edu</email>
</corresp>
</author-notes>
<pub-date pub-type="collection">
<month>2</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="epub">
<day>26</day>
<month>2</month>
<year>2019</year>
</pub-date>
<volume>15</volume>
<issue>2</issue>
<elocation-id>e1006721</elocation-id>
<history>
<date date-type="received">
<day>4</day>
<month>5</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>12</month>
<year>2018</year>
</date>
</history>
<permissions>
<copyright-statement>© 2019 Woloszynek et al</copyright-statement>
<copyright-year>2019</copyright-year>
<copyright-holder>Woloszynek et al</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>This is an open access article distributed under the terms of the
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution License</ext-link>
, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="pcbi.1006721.pdf"></self-uri>
<abstract>
<p>Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure
<italic>in situ</italic>
. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding (“embedding”) each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed
<italic>k</italic>
-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as
<italic>k</italic>
-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the
<italic>k</italic>
-mer embedding space captured distinct
<italic>k</italic>
-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.</p>
</abstract>
<abstract abstract-type="summary">
<title>Author summary</title>
<p>Improvements in the way genomes are sequenced have led to an abundance of microbiome data. With the right approaches, researchers use these data to thoroughly characterize how microbes interact with each other and their host, but sequencing data is of a form (sequences of letters) not ideal for many data analysis approaches. We therefore present an approach to transform sequencing data into arrays of numbers that can capture interesting qualities of the data at the sub-sequence, full-sequence, and sample levels. This allows us to measure the importance of certain microbial sequences with respect to the type of microbe and the condition of the host. Also, representing sequences in this way improves our ability to use other complicated modeling approaches. Using microbiome data from human samples, we show that our numeric representations captured differences between various types of microbes, as well as differences in the body site location from which the samples were collected.</p>
</abstract>
<funding-group>
<funding-statement>The authors received no specific funding for this work.</funding-statement>
</funding-group>
<counts>
<fig-count count="9"></fig-count>
<table-count count="2"></table-count>
<page-count count="25"></page-count>
</counts>
<custom-meta-group>
<custom-meta>
<meta-name>PLOS Publication Stage</meta-name>
<meta-value>vor-update-to-uncorrected-proof</meta-value>
</custom-meta>
<custom-meta>
<meta-name>Publication Update</meta-name>
<meta-value>2019-03-08</meta-value>
</custom-meta>
<custom-meta id="data-availability">
<meta-name>Data Availability</meta-name>
<meta-value>All code used to generate figures, perform analyses, and prepare data is stored on a github repository (
<ext-link ext-link-type="uri" xlink:href="https://github.com/EESI/microbiome_embeddings">https://github.com/EESI/microbiome_embeddings</ext-link>
).</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
<notes>
<title>Data Availability</title>
<p>All code used to generate figures, perform analyses, and prepare data is stored on a github repository (
<ext-link ext-link-type="uri" xlink:href="https://github.com/EESI/microbiome_embeddings">https://github.com/EESI/microbiome_embeddings</ext-link>
).</p>
</notes>
</front>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F88 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000F88 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:6407789
   |texte=   16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:30807567" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021