Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0002850 ( Pmc/Corpus ); précédent : 0002849; suivant : 0002851 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network</title>
<author>
<name sortKey="Wen, Jianghui" sort="Wen, Jianghui" uniqKey="Wen J" first="Jianghui" last="Wen">Jianghui Wen</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Liu, Yeshu" sort="Liu, Yeshu" uniqKey="Liu Y" first="Yeshu" last="Liu">Yeshu Liu</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shi, Yu" sort="Shi, Yu" uniqKey="Shi Y" first="Yu" last="Shi">Yu Shi</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Huang, Haoran" sort="Huang, Haoran" uniqKey="Huang H" first="Haoran" last="Huang">Haoran Huang</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Deng, Bing" sort="Deng, Bing" uniqKey="Deng B" first="Bing" last="Deng">Bing Deng</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.495882.a</institution-id>
<institution>Wuhan Academy of Agricultural Sciences,</institution>
</institution-wrap>
Wuhan, 430208 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xiao, Xinping" sort="Xiao, Xinping" uniqKey="Xiao X" first="Xinping" last="Xiao">Xinping Xiao</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">31519146</idno>
<idno type="pmc">6743109</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6743109</idno>
<idno type="RBID">PMC:6743109</idno>
<idno type="doi">10.1186/s12859-019-3039-3</idno>
<date when="2019">2019</date>
<idno type="wicri:Area/Pmc/Corpus">000285</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000285</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network</title>
<author>
<name sortKey="Wen, Jianghui" sort="Wen, Jianghui" uniqKey="Wen J" first="Jianghui" last="Wen">Jianghui Wen</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Liu, Yeshu" sort="Liu, Yeshu" uniqKey="Liu Y" first="Yeshu" last="Liu">Yeshu Liu</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Shi, Yu" sort="Shi, Yu" uniqKey="Shi Y" first="Yu" last="Shi">Yu Shi</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Huang, Haoran" sort="Huang, Haoran" uniqKey="Huang H" first="Haoran" last="Huang">Haoran Huang</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Deng, Bing" sort="Deng, Bing" uniqKey="Deng B" first="Bing" last="Deng">Bing Deng</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.495882.a</institution-id>
<institution>Wuhan Academy of Agricultural Sciences,</institution>
</institution-wrap>
Wuhan, 430208 People’s Republic of China</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xiao, Xinping" sort="Xiao, Xinping" uniqKey="Xiao X" first="Xinping" last="Xiao">Xinping Xiao</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2019">2019</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p id="Par1">Long-chain non-coding RNA (lncRNA) is closely related to many biological activities. Since its sequence structure is similar to that of messenger RNA (mRNA), it is difficult to distinguish between the two based only on sequence biometrics. Therefore, it is particularly important to construct a model that can effectively identify lncRNA and mRNA.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par2">First, the difference in the k-mer frequency distribution between lncRNA and mRNA sequences is considered in this paper, and they are transformed into the k-mer frequency matrix. Moreover, k-mers with more species are screened by relative entropy. The classification model of the lncRNA and mRNA sequences is then proposed by inputting the k-mer frequency matrix and training the convolutional neural network. Finally, the optimal k-mer combination of the classification model is determined and compared with other machine learning methods in humans, mice and chickens. The results indicate that the proposed model has the highest classification accuracy. Furthermore, the recognition ability of this model is verified to a single sequence.</p>
</sec>
<sec>
<title>Conclusion</title>
<p id="Par3">We established a classification model for lncRNA and mRNA based on k-mers and the convolutional neural network. The classification accuracy of the model with 1-mers, 2-mers and 3-mers was the highest, with an accuracy of 0.9872 in humans, 0.8797 in mice and 0.9963 in chickens, which is better than those of the random forest, logistic regression, decision tree and support vector machine.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Djebali, S" uniqKey="Djebali S">S Djebali</name>
</author>
<author>
<name sortKey="Davis, Ca" uniqKey="Davis C">CA Davis</name>
</author>
<author>
<name sortKey="Merkel, A" uniqKey="Merkel A">A Merkel</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wucher, V" uniqKey="Wucher V">V Wucher</name>
</author>
<author>
<name sortKey="Legeai, F" uniqKey="Legeai F">F Legeai</name>
</author>
<author>
<name sortKey="Hedan, B" uniqKey="Hedan B">B Hédan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Han, Sy" uniqKey="Han S">SY Han</name>
</author>
<author>
<name sortKey="Liang, Yc" uniqKey="Liang Y">YC Liang</name>
</author>
<author>
<name sortKey="Li, Y" uniqKey="Li Y">Y Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Ws" uniqKey="Li W">WS Li</name>
</author>
<author>
<name sortKey="Xiao, Xw" uniqKey="Xiao X">XW Xiao</name>
</author>
<author>
<name sortKey="Su, H" uniqKey="Su H">H Su</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Caley, Dp" uniqKey="Caley D">DP Caley</name>
</author>
<author>
<name sortKey="Pink, Rc" uniqKey="Pink R">RC Pink</name>
</author>
<author>
<name sortKey="Truillano, D" uniqKey="Truillano D">D Truillano</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nagano, T" uniqKey="Nagano T">T Nagano</name>
</author>
<author>
<name sortKey="Mitchell, Ja" uniqKey="Mitchell J">JA Mitchell</name>
</author>
<author>
<name sortKey="Sanz, La" uniqKey="Sanz L">LA Sanz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X Wang</name>
</author>
<author>
<name sortKey="Arai, S" uniqKey="Arai S">S Arai</name>
</author>
<author>
<name sortKey="Song, X" uniqKey="Song X">X Song</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wapinski, O" uniqKey="Wapinski O">O Wapinski</name>
</author>
<author>
<name sortKey="Chang, Hy" uniqKey="Chang H">HY Chang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kong, L" uniqKey="Kong L">L Kong</name>
</author>
<author>
<name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author>
<name sortKey="Ye, Zq" uniqKey="Ye Z">ZQ Ye</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sun, L" uniqKey="Sun L">L Sun</name>
</author>
<author>
<name sortKey="Luo, H" uniqKey="Luo H">H Luo</name>
</author>
<author>
<name sortKey="Bu, D" uniqKey="Bu D">D Bu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Dang, Hx" uniqKey="Dang H">HX Dang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mariner, Pd" uniqKey="Mariner P">PD Mariner</name>
</author>
<author>
<name sortKey="Walters, Rd" uniqKey="Walters R">RD Walters</name>
</author>
<author>
<name sortKey="Espinoza, Ca" uniqKey="Espinoza C">CA Espinoza</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lin, Mf" uniqKey="Lin M">MF Lin</name>
</author>
<author>
<name sortKey="Jungreis, I" uniqKey="Jungreis I">I Jungreis</name>
</author>
<author>
<name sortKey="Kellis, M" uniqKey="Kellis M">M Kellis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lertampaiporn, S" uniqKey="Lertampaiporn S">S Lertampaiporn</name>
</author>
<author>
<name sortKey="Thammarongtham, C" uniqKey="Thammarongtham C">C Thammarongtham</name>
</author>
<author>
<name sortKey="Nukoolkit, C" uniqKey="Nukoolkit C">C Nukoolkit</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wei, M" uniqKey="Wei M">M Wei</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qaisar, A" uniqKey="Qaisar A">A Qaisar</name>
</author>
<author>
<name sortKey="Syed, R" uniqKey="Syed R">R Syed</name>
</author>
<author>
<name sortKey="Azizuddin, B" uniqKey="Azizuddin B">B Azizuddin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
<author>
<name sortKey="Xu, X" uniqKey="Xu X">X Xu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, Y" uniqKey="Chen Y">Y Chen</name>
</author>
<author>
<name sortKey="Wang, L" uniqKey="Wang L">L Wang</name>
</author>
<author>
<name sortKey="Li, F" uniqKey="Li F">F Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zeng, H" uniqKey="Zeng H">H Zeng</name>
</author>
<author>
<name sortKey="Edwards, Md" uniqKey="Edwards M">MD Edwards</name>
</author>
<author>
<name sortKey="Liu, G" uniqKey="Liu G">G Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Alipanahi, B" uniqKey="Alipanahi B">B Alipanahi</name>
</author>
<author>
<name sortKey="Delong, A" uniqKey="Delong A">A Delong</name>
</author>
<author>
<name sortKey="Weirauch, Mt" uniqKey="Weirauch M">MT Weirauch</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Q" uniqKey="Zhang Q">Q Zhang</name>
</author>
<author>
<name sortKey="Zhu, L" uniqKey="Zhu L">L Zhu</name>
</author>
<author>
<name sortKey="Huang, Ds" uniqKey="Huang D">DS Huang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chuai, Gh" uniqKey="Chuai G">GH Chuai</name>
</author>
<author>
<name sortKey="Ma, Hh" uniqKey="Ma H">HH Ma</name>
</author>
<author>
<name sortKey="Yan, Jf" uniqKey="Yan J">JF Yan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gasri Plotnitsky, L" uniqKey="Gasri Plotnitsky L">L Gasri-Plotnitsky</name>
</author>
<author>
<name sortKey="Ovadia, A" uniqKey="Ovadia A">A Ovadia</name>
</author>
<author>
<name sortKey="Shamalov, K" uniqKey="Shamalov K">K Shamalov</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chou, Kc" uniqKey="Chou K">KC Chou</name>
</author>
<author>
<name sortKey="Shen, Hb" uniqKey="Shen H">HB Shen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chou, Kc" uniqKey="Chou K">KC Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chou, Kc" uniqKey="Chou K">KC Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, X" uniqKey="Chen X">X Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">BMC Bioinformatics</journal-id>
<journal-id journal-id-type="iso-abbrev">BMC Bioinformatics</journal-id>
<journal-title-group>
<journal-title>BMC Bioinformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1471-2105</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">31519146</article-id>
<article-id pub-id-type="pmc">6743109</article-id>
<article-id pub-id-type="publisher-id">3039</article-id>
<article-id pub-id-type="doi">10.1186/s12859-019-3039-3</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methodology Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Wen</surname>
<given-names>Jianghui</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Yeshu</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Shi</surname>
<given-names>Yu</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Huang</surname>
<given-names>Haoran</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Deng</surname>
<given-names>Bing</given-names>
</name>
<address>
<email>dengbing0906@163.com</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Xiao</surname>
<given-names>Xinping</given-names>
</name>
<address>
<email>xiaoxp@whut.edu.cn</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9291 3229</institution-id>
<institution-id institution-id-type="GRID">grid.162110.5</institution-id>
<institution>School of Science,</institution>
<institution>Wuhan University of Technology,</institution>
</institution-wrap>
Wuhan, 430070 People’s Republic of China</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="GRID">grid.495882.a</institution-id>
<institution>Wuhan Academy of Agricultural Sciences,</institution>
</institution-wrap>
Wuhan, 430208 People’s Republic of China</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>13</day>
<month>9</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>13</day>
<month>9</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>20</volume>
<elocation-id>469</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>12</month>
<year>2018</year>
</date>
<date date-type="accepted">
<day>21</day>
<month>8</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s). 2019</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p id="Par1">Long-chain non-coding RNA (lncRNA) is closely related to many biological activities. Since its sequence structure is similar to that of messenger RNA (mRNA), it is difficult to distinguish between the two based only on sequence biometrics. Therefore, it is particularly important to construct a model that can effectively identify lncRNA and mRNA.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par2">First, the difference in the k-mer frequency distribution between lncRNA and mRNA sequences is considered in this paper, and they are transformed into the k-mer frequency matrix. Moreover, k-mers with more species are screened by relative entropy. The classification model of the lncRNA and mRNA sequences is then proposed by inputting the k-mer frequency matrix and training the convolutional neural network. Finally, the optimal k-mer combination of the classification model is determined and compared with other machine learning methods in humans, mice and chickens. The results indicate that the proposed model has the highest classification accuracy. Furthermore, the recognition ability of this model is verified to a single sequence.</p>
</sec>
<sec>
<title>Conclusion</title>
<p id="Par3">We established a classification model for lncRNA and mRNA based on k-mers and the convolutional neural network. The classification accuracy of the model with 1-mers, 2-mers and 3-mers was the highest, with an accuracy of 0.9872 in humans, 0.8797 in mice and 0.9963 in chickens, which is better than those of the random forest, logistic regression, decision tree and support vector machine.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>lncRNA</kwd>
<kwd>mRNA</kwd>
<kwd>K-mers</kwd>
<kwd>Relative entropy</kwd>
<kwd>Convolutional neural network</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100001809</institution-id>
<institution>National Natural Science Foundation of China</institution>
</institution-wrap>
</funding-source>
<award-id>61403288</award-id>
<award-id>71871174</award-id>
<principal-award-recipient>
<name>
<surname>Wen</surname>
<given-names>Jianghui</given-names>
</name>
<name>
<surname>Xiao</surname>
<given-names>Xinping</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<funding-group>
<award-group>
<funding-source>
<institution>Natural Science Foundation of Hubei Province, China</institution>
</funding-source>
<award-id>2019cfb589</award-id>
<principal-award-recipient>
<name>
<surname>Deng</surname>
<given-names>Bing</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2019</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p id="Par16">Transcription of the genome includes messenger RNAs (mRNAs), small (miRNAs, snRNAs) and non-coding RNAs (ncRNAs) [
<xref ref-type="bibr" rid="CR1">1</xref>
,
<xref ref-type="bibr" rid="CR2">2</xref>
]. LncRNA is a kind of noncoding RNA with a length exceeding 200 nucleotides [
<xref ref-type="bibr" rid="CR3">3</xref>
]. There is a growing concern over the long non-coding RNA [
<xref ref-type="bibr" rid="CR4">4</xref>
]. Current studies demonstrate that lncRNA sequences primarily play a role in two aspects of the organism. On one hand, they play a vital biological function in many stages, such as transcription and regulation of life processes. For example, lncRNAs can participate in the regulation of gene expression levels at three levels, epigenetic regulation, transcriptional regulation and post-transcriptional regulation [
<xref ref-type="bibr" rid="CR5">5</xref>
], and some lncRNAs can bind to specific chromatin-related sites through chromatin remodelling, resulting in the expression of silenced related genes [
<xref ref-type="bibr" rid="CR6">6</xref>
,
<xref ref-type="bibr" rid="CR7">7</xref>
]. On the other hand, lncRNAs have direct or indirect links with some diseases in humans, such as lung cancer, prostate cancer, Alzheimer’s disease, Prader-Willi syndrome, agitation, etc. [
<xref ref-type="bibr" rid="CR4">4</xref>
]. Therefore, the identification and inclusion of lncRNAs will help researchers to further research and explore their functions at the molecular level [
<xref ref-type="bibr" rid="CR8">8</xref>
].</p>
<p id="Par17">However, thus far, only a small number of lncRNAs have been included in the non-coding RNA-related database. Additionally, in the existing database, only a small number of lncRNA functions have been thoroughly studied and annotated. Even more difficult is that functional studies of the lncRNA sequence are based on the premise that they can determine whether the sequence is a lncRNA, which is the main difficulty in biological and information biology research. Since lncRNAs and mRNAs have many similarities in sequence structure, the task of identifying lncRNA sequences becomes more challenging. Accordingly, how to design a model that can accurately identify lncRNA and mRNA sequences based on the large amount of sequence data obtained by high-throughput sequencing will be an important biological research topic.</p>
<p id="Par18">At present, research mainly classifies coding RNA and non-coding RNA based on three aspects: first is the discrimination by the length of the open reading frame of the coding sequence and the non-coding sequence, second is the discrimination by comparing the similarity between the sequence and the known protein sequence using the comparative genomics methods, and third is the prediction by conservation of the RNA secondary structure. However, each of these three methods has its own merits and demerits, and it is difficult to acquire accurate sequence classification results based on only one of them. To solve this problem, some scholars have constructed models and software for classifying mRNAs and lncRNAs by extracting non-coding features in lncRNA sequences. For example, the Bioinformatics Center of Peking University had developed an online lncRNA identification tool, CPC (Coding Potential Calculator, CPC) [
<xref ref-type="bibr" rid="CR9">9</xref>
], which has been widely used in many fields, such as sequence alignment, disease research and evolution analysis. Its principle is mainly to extract six features, containing the ratio of the length of the open reading frame to the sequence length, the integrity of the open reading frame, the prediction reliability evaluation score of the open reading frame, etc., train data by placing those features into a support vector machine (SVM), and develop a prediction model of non-coding RNA. Sun et al. [
<xref ref-type="bibr" rid="CR10">10</xref>
] proposed a method named CNCI (Coding-Non-Coding Index, CNCI) based on Adjoining Nucleotide Triplets (ANT). Its framework consisted of a scoring matrix and classification model. First, the species categories of the sample data were determined, and the probability of occurrence of each pair of adjacent triplets in the coding region, non-coding region and inter-gene region was respectively counted to construct three ANT probability matrices. Then, as the reference, the log-ratio of the ANT probability matrix of the coding and non-coding region were respectively calculated to obtain the scoring matrix of the CNCI algorithm. Further, The CNCI scoring matrix was used to determine the Most-Like Coding Domain Sequence (MLCDS), and then five different features were extracted from each MLCDS for the classification. Dang [
<xref ref-type="bibr" rid="CR11">11</xref>
] selected three characteristics from the perspective of the open reading frame, three characteristics of the integrated sequence secondary structure and two characteristics of protein similarity and summarized seven combinations of three types of features. Her lncRNA prediction model could be suitable for different data source.</p>
<p id="Par19">The CSF prediction software proposed by Lin et al. [
<xref ref-type="bibr" rid="CR12">12</xref>
] mainly aimed to identify lncRNAs by calculating the frequency of codon substitution in the target sequence. Based on the CSF model, the evolution information of the alignment sequence was introduced, and they developed the Phylo CSF recognition model [
<xref ref-type="bibr" rid="CR13">13</xref>
]. Wucher et al. developed the FEELnc program, an lncRNA and mRNA recognition tool, which was a random forest-based classification model trained using features such as open reading frames [
<xref ref-type="bibr" rid="CR2">2</xref>
]. In 2014, Lertampaiporn et al. developed a hybrid model based on logistic regression and random forests to distinguish short non-coding RNA sequences from lncRNA sequences. The model synthesized five combined features, SCORE, which improved the lncRNA classification performance [
<xref ref-type="bibr" rid="CR14">14</xref>
].</p>
<p id="Par20">To summarize, most of the available methods for identifying lncRNA among mRNA sequences are based on the biological characteristics of the sequences. However, the lncRNA sequence may contain some sequences that can overlap with the coding regions of mRNAs [
<xref ref-type="bibr" rid="CR2">2</xref>
]. Thus, the recognition of lncRNA sequences is more complex than the recognition of mRNA sequences when using existing methods. To avoid the use of sequence biological characteristics to establish a classification model of the sequence, Wei proposed an lncRNA and mRNA classification model based on the k-mer [
<xref ref-type="bibr" rid="CR15">15</xref>
]. This model used the maximum entropy algorithm to screen k-mers and the support vector machine algorithm for classification; however, it demonstrated great computational complexity and a high computational cost. In addition, pre-processing of raw input data and sequence features should be selected by domain-expert knowledge and to fine-tune parameters to increase accuracy when using conventional machine learning algorithms such as support vector machines, logistic regression, decision trees, SVM, NN, BNs, GAs, and HMMs, etc. [
<xref ref-type="bibr" rid="CR16">16</xref>
]. Therefore, we propose a model to effectively classify lncRNAs and mRNAs, without relying on the sequencing quality and biological structural characteristics of the sequence, as well as avoiding a large number of calculations.</p>
<p id="Par21">Since the Convolutional Neural Network (CNN) model can self-learn the characteristics of the sequence through continuous training without artificial intervention and efficiently calculate large amounts of data, no domain-expert knowledge or fine-tuning of parameters to increase accuracy are needed [
<xref ref-type="bibr" rid="CR17">17</xref>
,
<xref ref-type="bibr" rid="CR18">18</xref>
]. It has been used to predict DNA-protein binding sites [
<xref ref-type="bibr" rid="CR19">19</xref>
] and to predict the specificity of DNA and RNA binding proteins [
<xref ref-type="bibr" rid="CR20">20</xref>
]. Zhang et al., developed two methods for predicting DNA-protein binding using the High-Order Convolutional Neural Network Architecture and Weakly Supervised Convolutional Neural Network Architecture [
<xref ref-type="bibr" rid="CR21">21</xref>
,
<xref ref-type="bibr" rid="CR22">22</xref>
]. Transcription factor prediction using ChIP-seq data [
<xref ref-type="bibr" rid="CR23">23</xref>
] and CRISPR guide RNA design [
<xref ref-type="bibr" rid="CR24">24</xref>
] can also be finely conducted using CNN. Whether CNN can be finely used in the classification of lncRNAs and mRNAs is not known.</p>
<p id="Par22">In this study, we intend to introduce the convolutional neural network model to establish a classification model of lncRNAs and mRNAs. The content of this paper is arranged as follows. First, the k-mer frequency information for lncRNA and mRNA sequences is statistically analysed. Second, we construct the classification model of lncRNA and mRNA sequences by convolutional neural network taking the k-mer frequency matrix as input. Third, we determine the optimal k-mer combination of the model, compare it with those of other machine learning methods and verify the recognition ability of identifying a single sequence.</p>
</sec>
<sec id="Sec2">
<title>Results and discussion</title>
<sec id="Sec3">
<title>Training data and testing data</title>
<p id="Par23">We download human lncRNA sequence data and mRNA sequence data from the GENCODE database, gencode.v 26. The 10,000 sequences data are randomly selected from the two sample sets each time, i.e., 10,000 lncRNA sequences and 10,000 mRNA sequences, of which 8000 sequences are selected as training samples and the remaining 2000 sequences are used as test samples. We perform 10 random selections to verify the contingency impact of the randomly selected data training model.</p>
<p id="Par24">The frequency means of 2-mers in the 10 sets of sequences are calculated, and the line graphs are shown in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
a and b. In Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
, the lncRNA mean line graphs in the 10 sets of data almost coincide. Only the AA of the first set of data is slightly different from the other groups. The average AA frequency of the first set of data is 0.069, the average AA frequency of the second, third, fourth, sixth, eighth, ninth, and tenth sets of data is 0.067, while the fifth set is 0.068. It can be seen that the difference between the data does not exceed 0.002, and the error is small. From Fig.
<xref rid="Fig1" ref-type="fig">1</xref>
b, the mRNA means line graphs in the 10 sets of data also show mostly overlap, and only the four k-mers of AA, AT, GC, and GG differ. However, the data show that the extremes of the frequency means of AA, AT, GC, and GG in the 10 sets of data are approximately 0.0048, 0.0048, 0.0046, and 0.0036 respectively. Thus, the differences between are not large. Therefore, the randomness of the data extraction does not greatly affect the calculation results of the model.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>The 2-mer frequency mean line graph.
<bold>a</bold>
The 2-mer frequency mean line graph of lncRNA.
<bold>b</bold>
The 2-mer frequency mean line graph of mRNA</p>
</caption>
<graphic xlink:href="12859_2019_3039_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
</sec>
<sec id="Sec4">
<title>Determination of k-mer parameters of the lncRNA classification model</title>
<p id="Par25">First, the calculation is divided into two steps. In the first step, lncRNA sequences ranging from 250 nt to 3500 nt and mRNA sequences ranging from 200 nt to 4000 nt are selected. The k-mer subsequence is extracted using the k-mer algorithm. For the k-mers with larger values, the relative entropy is used to select the features, and the accuracy of the model before and after the screening is compared. Finally, the frequency of the k-mer subsequence in each sequence is counted, and the frequency matrix is constructed. In the second step, the convolutional neural network model is trained using the constructed frequency matrix to obtain the classification results of the model. When
<italic>k</italic>
is taken different values, the results are compared to obtain the optimal classification model parameters.</p>
</sec>
<sec id="Sec5">
<title>Determination of optimal k-mer combination in the lncRNA classification model</title>
<p id="Par26">Based on the statistical analysis, we randomly select 10,000 lncRNA sequence data ranging from 250 nt to 3000 nt in the lncRNA dataset downloaded from the GENCODE database, and we also randomly select 10,000 mRNA sequence data ranging from 200 nt to 4000 nt in length.</p>
<p id="Par27">Next, we build an lncRNA and mRNA classification model. The first layer of the convolutional neural network uses 32 convolution kernels of 3 × 3, selects the Relu activation function, and the periphery of the k-mer frequency matrix is padded with “0” to ensure a constant size of the matrix before and after the convolution calculation. The second layer is still the convolutional layer, with 64 convolution kernels of 3 × 3, and the activation function is still the Relu function. The third layer is the largest pooling layer, and the size of the pooling area is 2 × 2. The partial neuron connections with a probability of 0.25 are omitted before the fully connected layer to prevent overfitting. The last layer is the fully connected layer. There are 128 neurons in the fully connected layer. After the whole layer is connected, the probability of connections between the omitted neurons is 0.5. Finally, the SoftMax function is used to obtain the classification result. The loss function in the model training process selects the cross entropy loss function, and the optimizer is Adadelta.</p>
<p id="Par28">To determine the most differential k-mers in lncRNA and mRNA sequences and to maximize the accuracy of the model k-mers, we select k-mers with different
<italic>k</italic>
values. The established lncRNA and mRNA classification models are used to learn autonomously. Finally, the classification accuracy, model accuracy, recall rate and
<italic>F</italic>
<sub>1</sub>
score of the classification model are compared when different
<italic>k</italic>
values are compared.</p>
<p id="Par29">We take a single
<italic>k</italic>
value, which is 3, 4, 5, and 6. The specific results are shown in Table 
<xref rid="Tab1" ref-type="table">1</xref>
.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Model classification accuracy for individual
<italic>k</italic>
value</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>
<italic>k</italic>
value</th>
<th>number of k-mers</th>
<th>matrix form</th>
<th>model accuracy</th>
<th>precision rate(P)</th>
<th>recall
<break></break>
rate(R)</th>
<th>
<italic>F</italic>
<sub>1</sub>
score</th>
<th>calculating time (s/epoch)</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>64</td>
<td>8 × 8</td>
<td>0.7508</td>
<td>0.81</td>
<td>0.79</td>
<td>0.79</td>
<td>5</td>
</tr>
<tr>
<td>4</td>
<td>256</td>
<td>16 × 16</td>
<td>0.7610</td>
<td>0.85</td>
<td>0.83</td>
<td>0.83</td>
<td>20</td>
</tr>
<tr>
<td>5</td>
<td>1024</td>
<td>32 × 32</td>
<td>0.7565</td>
<td>0.93</td>
<td>0.92</td>
<td>0.92</td>
<td>95</td>
</tr>
<tr>
<td>6</td>
<td>4096</td>
<td>64 × 64</td>
<td>0.7748</td>
<td>0.87</td>
<td>0.85</td>
<td>0.84</td>
<td>855</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par30">It can be seen from Table
<xref rid="Tab1" ref-type="table">1</xref>
that the classification effect of the model is different when different
<italic>k</italic>
values are taken. As the
<italic>k</italic>
value increases, the number of k-mers increases and the model accuracy generally increases. However, when
<italic>k</italic>
 = 5, then the accuracy of the classification model is slightly lower than
<italic>k</italic>
 = 4, but the difference does not exceed 0.01, and the difference is not large. When there are too many types of k-mers, the frequency of each k-mer will also decrease, and even the frequency of most k-mers will be 0, so each k-mer will carry less difference information. When
<italic>k</italic>
is 6, the accuracy is the highest, but it is only 0.7748. This classification is not ideal, and its time complexity is the highest because it requires 855 s to calculate a time of 1 epoch. However, when
<italic>k</italic>
is equal to 3, the accuracy is only approximately 0.024 lower than it is with
<italic>k</italic>
 = 6, but it takes only 5 s to calculate 1 epoch. Therefore, we attempt to combine these individual k-mers in pairs and analyse the results to determine whether this attempt is reasonable. The specific calculation results are shown in Table 
<xref rid="Tab2" ref-type="table">2</xref>
.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Model classification accuracy rate of two
<italic>k</italic>
value combinations</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>
<italic>k</italic>
value</th>
<th>number of k-mers</th>
<th>matrix form</th>
<th>model accuracy</th>
<th>precision rate(P)</th>
<th>recall rate(R)</th>
<th>
<italic>F</italic>
<sub>1</sub>
score</th>
<th>calculating time (s/epoch)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 + 3</td>
<td>68</td>
<td>17 × 4</td>
<td>0.9280</td>
<td>0.94</td>
<td>0.94</td>
<td>0.94</td>
<td>4</td>
</tr>
<tr>
<td>1 + 4</td>
<td>260</td>
<td>10 × 26</td>
<td>0.9600</td>
<td>0.98</td>
<td>0.98</td>
<td>0.98</td>
<td>32</td>
</tr>
<tr>
<td>1 + 5</td>
<td>1028</td>
<td>4 × 257</td>
<td>0.4995</td>
<td>0.50</td>
<td>0.50</td>
<td>0.36</td>
<td>43</td>
</tr>
<tr>
<td>2 + 3</td>
<td>80</td>
<td>8 × 10</td>
<td>0.9810</td>
<td>0.99</td>
<td>0.99</td>
<td>0.99</td>
<td>9</td>
</tr>
<tr>
<td>2 + 4</td>
<td>272</td>
<td>16 × 17</td>
<td>0.7838</td>
<td>0.87</td>
<td>0.86</td>
<td>0.86</td>
<td>37</td>
</tr>
<tr>
<td>2 + 5</td>
<td>1040</td>
<td>26 × 40</td>
<td>0.7672</td>
<td>0.91</td>
<td>0.90</td>
<td>0.90</td>
<td>180</td>
</tr>
<tr>
<td>3 + 4</td>
<td>320</td>
<td>16 × 20</td>
<td>0.7666</td>
<td>0.90</td>
<td>0.90</td>
<td>0.90</td>
<td>47</td>
</tr>
<tr>
<td>3 + 5</td>
<td>1088</td>
<td>32 × 34</td>
<td>0.7566</td>
<td>0.94</td>
<td>0.94</td>
<td>0.94</td>
<td>189</td>
</tr>
<tr>
<td>4 + 5</td>
<td>1280</td>
<td>32 × 40</td>
<td>0.7532</td>
<td>0.95</td>
<td>0.94</td>
<td>0.94</td>
<td>290</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par31">Since the time complexity is too high when
<italic>k</italic>
 = 6, if the combination calculation is performed, then the calculation time will be unsatisfactory, only taking
<italic>k</italic>
to be 1, 2, 3, 4, and 5 in pairs. From Table
<xref rid="Tab2" ref-type="table">2</xref>
, the recognition accuracy of the classification model is significantly improved when we combine the two k-mers, especially the combination of
<italic>k</italic>
 = 2 and
<italic>k</italic>
 = 3, with an accuracy reaching 0.9810. The second is a combination of
<italic>k</italic>
 = 1 and
<italic>k</italic>
 = 4 with an accuracy of 0.9600. The result can be explained by the combination of k-mers, which is equivalent to strengthening the k-mer information of the sequence, after which the model can receive more difference information through convolutional neural network self-learning. However, Table
<xref rid="Tab2" ref-type="table">2</xref>
also reveals such information. Although the combined information can greatly improve the accuracy of model recognition, not every combination of information can improve the recognition accuracy of the model compared with before the combination. For example, when
<italic>k</italic>
 = 5, as shown in Table
<xref rid="Tab1" ref-type="table">1</xref>
, the classification accuracy of the model is 0.7565, and when
<italic>k</italic>
 = 1 and
<italic>k</italic>
 = 5 are combined, the accuracy is 0.4995, and the accuracy of the model is not increased but decreased.</p>
<p id="Par32">Although the combined k-mers can greatly improve the accuracy of the classification model, this strategy also consumes more computation time than the model of a single k-mer. By comparison, the calculation time of the classification model is proportional to the number of k-mers. When the number of k-mers is larger, the calculation time consumed by the model is also greater. There are 80 k-mers in the combination of
<italic>k</italic>
 = 2 and
<italic>k</italic>
 = 3, and the calculation time consumed by 1 epoch is 9 s, which is second only to the combination of
<italic>k</italic>
 = 1 and
<italic>k</italic>
 = 3. The longest calculation time is the combination of
<italic>k</italic>
 = 4 and
<italic>k</italic>
 = 5, which contains a total of 1280 k-mers. The calculation of 1 epoch requires 290 s. If 200 iterations are obtained, it will take approximately 16 h to train the model. It should be noted that the calculation time in the combination of
<italic>k</italic>
 = 1 and
<italic>k</italic>
 = 3 is 4 s, which is faster than the calculation time in
<italic>k</italic>
 = 3 (5 s). This phenomenon is due to the presence of 68 k-mers in the combination of
<italic>k</italic>
 = 1 and
<italic>k</italic>
 = 3, which is input as 4 × 17 matrix, and 2 × 2 convolution layer for feature extraction. When the number of 3-mers is 64, which is input as 8 × 8 matrix, and, the convolution layer is 3 × 3 for feature extraction. Then the calculation time for the combination of
<italic>k</italic>
 = 1 and
<italic>k</italic>
 = 3 is faster than the calculation time for
<italic>k</italic>
 = 3.</p>
<p id="Par33">Based on the information presented in Table
<xref rid="Tab2" ref-type="table">2</xref>
, the combination of
<italic>k</italic>
 = 2 and
<italic>k</italic>
 = 3 provides the classification model with the highest combination information. Furthermore, the computational time cost of this combination is relatively low. Therefore, we attempt to combine
<italic>k</italic>
 = 2 and
<italic>k</italic>
 = 3 with other k-mers and use more combination information to verify whether the combination of k-mers of the three
<italic>k</italic>
values will further improve the accuracy of the model based on the combination of the two
<italic>k</italic>
values of k-mers. The specific calculation results are shown in Table 
<xref rid="Tab3" ref-type="table">3</xref>
.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Model classification accuracy rate of three
<italic>k</italic>
value combinations</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>
<italic>k</italic>
value</th>
<th>number of k-mers</th>
<th>matrix form</th>
<th>model accuracy</th>
<th>precision rate</th>
<th>recall rate</th>
<th>
<italic>F</italic>
<sub>1</sub>
score</th>
<th>calculating time (s/epoch)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1 + 2 + 3</td>
<td>84</td>
<td>17 × 20</td>
<td>0.9872</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>6</td>
</tr>
<tr>
<td>2 + 3 + 4</td>
<td>336</td>
<td>12 × 28</td>
<td>0.9738</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>57</td>
</tr>
<tr>
<td>2 + 3 + 5</td>
<td>1104</td>
<td>24 × 46</td>
<td>0.9798</td>
<td>1.00</td>
<td>1.00</td>
<td>1.00</td>
<td>217</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par34">Based on Table
<xref rid="Tab3" ref-type="table">3</xref>
, we find that the combination of
<italic>k</italic>
 = 1,
<italic>k</italic>
 = 2, and
<italic>k</italic>
 = 3 can further improve the accuracy of the model due to the combination of k-mers, and the recognition accuracy of the model can reach 0.9872, as shown in Table
<xref rid="Tab3" ref-type="table">3</xref>
. More excitingly, the calculation time is only 6 s, which is far less than that of other k-mer combinations. Consequently, the k-mer combination of
<italic>k</italic>
 = 1, 2, 3 not only achieves the best model accuracy but also has an accuracy rate and recall rate and
<italic>F</italic>
<sub>1</sub>
score of 1.00, which indicates that the classification effect of the model is also excellent. Based on the above results, we determine the k-mers that allow optimal construction of the lncRNA and mRNA classification model, which is the combination of 1-mers, 2-mers and 3-mers.</p>
</sec>
<sec id="Sec6">
<title>Determination of the optimal combination of k-mers for the selected lncRNA classification model</title>
<p id="Par35">In the calculation process, we find that when the
<italic>k</italic>
value is greater than 4, that is, when it is 5 or 6 or more, a considerable portion of the k-mers exhibits a frequency of 0. The lack of most k-mer values may affect the recognition accuracy of the model, resulting in a low classification accuracy of the model. To verify this conjecture, we use the relative entropy to filter the k-mers of
<italic>k</italic>
 = 5 and
<italic>k</italic>
 = 6. By sorting the information gains and selecting the top 98% k-mers, the k-mers carrying more and less difference information are filtered out. This method can also effectively reduce the dimensions of the k-mer frequency matrix.</p>
<p id="Par36">As shown in Table 
<xref rid="Tab4" ref-type="table">4</xref>
, the k-mers of
<italic>k</italic>
 = 5 are reduced from the original 1024 k-mers to 115 after screening with relative entropy. The 115 k-mers are constructed with a matrix of 5 × 23. Finally, after screening with relative entropy, the 5-mers improve the model accuracy of
<italic>k</italic>
 = 5 from 0.7565 to 0.7820. Similarly, the 6-mers screened by relative entropy show a reduction from the original 4096 to 1045. Although the accuracy of the 6-mers model is slightly improved from 0.7748 to 0.7790, this improvement is almost negligible compared with before the relative entropy screening. Since the number of k-mers reaches as high as 4096 in
<italic>k</italic>
 = 6, the difference information for the sequence becomes very fragmented. Although the relative entropy screening retains 98% of the difference information, the latter part of the information is abandoned. This may explain why the accuracy of the model does not increase significantly.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>K-mers calculation results after KL screening</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>
<italic>k</italic>
value</th>
<th>number of k-mers</th>
<th>number of k-mers after KL screening</th>
<th>original model accuracy</th>
<th>model accuracy after KL screening</th>
<th>calculation time of the original model (s/epoch)</th>
<th>calculation time of KL screening model (s/epoch)</th>
</tr>
</thead>
<tbody>
<tr>
<td>5</td>
<td>1024</td>
<td>115</td>
<td>0.7565</td>
<td>0.782</td>
<td>95 s</td>
<td>4 s</td>
</tr>
<tr>
<td>6</td>
<td>4096</td>
<td>1045</td>
<td>0.7748</td>
<td>0.779</td>
<td>855 s</td>
<td>47 s</td>
</tr>
<tr>
<td>4 + 5</td>
<td>1280</td>
<td>112</td>
<td>0.7532</td>
<td>0.629</td>
<td>290 s</td>
<td>4 s</td>
</tr>
<tr>
<td>2 + 3 + 5</td>
<td>1104</td>
<td>195</td>
<td>0.9798</td>
<td>0.9761</td>
<td>217 s</td>
<td>27 s</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par37">To compare the classification accuracy of the k-mer combination, we combine the 5-mers after the relative entropy screening with the k-mers of
<italic>k</italic>
 = 4. It is found that the combined k-mer accuracy is only 0.6290, and the accuracy of the model is not improved but reduced. In addition, we combine the 5-mers after the relative entropy screening with the k-mers of
<italic>k</italic>
 = 2 and
<italic>k</italic>
 = 3, and we find that the accuracy of the model is improved to 0.9761, but it is still not the k-mer combination that provides the best model classification.</p>
<p id="Par38">Although the accuracy of the model is not very obvious after the relative entropy screening, according to the information in Table
<xref rid="Tab4" ref-type="table">4</xref>
, since the k-mers are screened by relative entropy screening, the number of k-mers is reduced. The computation time of the model after relative entropy screening is greatly reduced. In particular, the combination of
<italic>k</italic>
 = 4 and
<italic>k</italic>
 = 5 reduces the calculation time from 290 to 4 s, a reduction of more than 70 times.</p>
</sec>
<sec id="Sec7">
<title>Comparison of the model accuracy with four machine learning methods</title>
<p id="Par39">Based on previous analysis, when k-mers of
<italic>k</italic>
 = 1,
<italic>k</italic>
 = 2 and
<italic>k</italic>
 = 3 are combined as input in the convolutional neural network, the accuracy of the classification model can be maximized. In the case of the 10-fold cross-validation calculation, the training set loss function value of the convolutional neural network model averages 0.043, and the average classification accuracy rate is 0.9872. The average loss function of the verification set is 0.0431, and the average accuracy of the verification set is 0.9790. In the machine learning algorithm, such as the random forest, logistic regression, decision tree, and support vector machine, to compare the superiority of the convolutional neural network model in the classification of lncRNA sequences and mRNA sequences, we use these four algorithms to classify lncRNA and mRNA sequences.</p>
<p id="Par40">For these four machine learning algorithms, we use the same training data set and verification data set as the convolutional neural network model to train and verify the model separately, and we compare the results with the convolutional neural network model. The results are shown in Table 
<xref rid="Tab5" ref-type="table">5</xref>
.
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>Five model effect comparison table in human</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>model</th>
<th>model accuracy</th>
<th>precision rate(P)</th>
<th>recall rate(R)</th>
<th>
<italic>F</italic>
<sub>1</sub>
score</th>
</tr>
</thead>
<tbody>
<tr>
<td>CNN</td>
<td>0.9872</td>
<td>0.9993</td>
<td>0.9955</td>
<td>0.9974</td>
</tr>
<tr>
<td>RF</td>
<td>0.8820</td>
<td>0.8949</td>
<td>0.8867</td>
<td>0.8925</td>
</tr>
<tr>
<td>LR</td>
<td>0.7020</td>
<td>0.7247</td>
<td>0.7183</td>
<td>0.7218</td>
</tr>
<tr>
<td>DT</td>
<td>0.8030</td>
<td>0.7873</td>
<td>0.7852</td>
<td>0.7869</td>
</tr>
<tr>
<td>SVM</td>
<td>0.7020</td>
<td>0.7245</td>
<td>0.7158</td>
<td>0.7179</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par41">From Table
<xref rid="Tab5" ref-type="table">5</xref>
, in terms of model accuracy, the model accuracy of the convolutional neural network algorithm is 0.9872, which is far superior to those of other algorithms. Followed by random forests, the classification accuracy is 0.8820. Again, the decision tree has a classification accuracy of 0.8030. The classification accuracy of the logistic regression and support vector machine is the same at only 0.7020.</p>
<p id="Par42">The precision rate (P), recall rate (R) and F1 score are also shown in Table
<xref rid="Tab5" ref-type="table">5</xref>
, all of which are superior to RF, LR, DT, and SVM in CNN. The ROC curve (receiver operating characteristic curve) of CNN, RF, LR, DT and SVM is shown in Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
, and AUC (Area Under Curve) values are 1, 0.9689, 0.7807, 0.8009 and 0.7848 respectively, which also indicates that CNN is better than other methods.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>ROC curve of CNN, RF, LR, DT and SVM</p>
</caption>
<graphic xlink:href="12859_2019_3039_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p id="Par43">We also use mouse and chicken data to compare the superiority of the convolutional neural network model (combined k-mers of
<italic>k</italic>
 = 1,
<italic>k</italic>
 = 2 and
<italic>k</italic>
 = 3) in the classification of lncRNA sequences and mRNA sequences, and the results are shown in Table 
<xref rid="Tab6" ref-type="table">6</xref>
(mouse) and Table 
<xref rid="Tab7" ref-type="table">7</xref>
(chicken). CNN also has the highest model accuracy compared with the others. The model accuracy of the convolutional neural network algorithm is 0.8797 in mouse and 0.9963 in chicken.
<table-wrap id="Tab6">
<label>Table 6</label>
<caption>
<p>Five model effect comparison table in mouse</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>model</th>
<th>model accuracy</th>
<th>precision rate(P)</th>
<th>recall rate(R)</th>
<th>
<italic>F</italic>
<sub>1</sub>
score</th>
</tr>
</thead>
<tbody>
<tr>
<td>CNN</td>
<td>0.8797</td>
<td>0.8960</td>
<td>0.8590</td>
<td>0.8771</td>
</tr>
<tr>
<td>RF</td>
<td>0.8120</td>
<td>0.8132</td>
<td>0.8130</td>
<td>0.8131</td>
</tr>
<tr>
<td>LR</td>
<td>0.7541</td>
<td>0.7454</td>
<td>0.7700</td>
<td>0.7575</td>
</tr>
<tr>
<td>DT</td>
<td>0.7001</td>
<td>0.6991</td>
<td>0.6977</td>
<td>0.6984</td>
</tr>
<tr>
<td>SVM</td>
<td>0.7528</td>
<td>0.7564</td>
<td>0.7476</td>
<td>0.7520</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="Tab7">
<label>Table 7</label>
<caption>
<p>Five model effect comparison table in chicken</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>model</th>
<th>model accuracy</th>
<th>precision rate(P)</th>
<th>recall rate(R)</th>
<th>
<italic>F</italic>
<sub>1</sub>
score</th>
</tr>
</thead>
<tbody>
<tr>
<td>CNN</td>
<td>0.9963</td>
<td>0.9943</td>
<td>0.9984</td>
<td>0.9963</td>
</tr>
<tr>
<td>RF</td>
<td>0.9302</td>
<td>0.9351</td>
<td>0.9245</td>
<td>0.9298</td>
</tr>
<tr>
<td>LR</td>
<td>0.8743</td>
<td>0.8902</td>
<td>0.8546</td>
<td>0.8720</td>
</tr>
<tr>
<td>DT</td>
<td>0.8227</td>
<td>0.8148</td>
<td>0.8315</td>
<td>0.8230</td>
</tr>
<tr>
<td>SVM</td>
<td>0.8724</td>
<td>0.8881</td>
<td>0.8538</td>
<td>0.8706</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec8">
<title>Verification of the classification model in single lncRNA sequence recognition</title>
<p id="Par44">Our results are tested using 2000 mRNA sequences and 2000 lncRNA sequences selected from gencode.v 26 data. In addition, to verify whether the proposed classification model is suitable for the identification of a single sequence, we download a human lncRNA sequence in the NCBI database, which was discovered by Professor Gasri-Plotnitsky, an Israeli professor at the University of Barllan’s Institute of Life Sciences, and his team, and published in Oncotarget magazine in 2017 [
<xref ref-type="bibr" rid="CR25">25</xref>
]. This lncRNA sequence is called GASL1. Professor Gasri-Plotnitsky’s research indicates that GASL1 expression inhibits cell cycle progression, identifying it as novel lncRNA modulator of cell cycle progression and cell proliferation, and has a potential role in cancer. Simultaneously, if the expression level of GASL1 is low in liver cancer patients, the survival rate may be worse.</p>
<p id="Par45">Taking the GASL1 sequence as an example, we verify whether the classification model of the lncRNA and mRNA sequences proposed in this paper can correctly classify the sequences. The frequencies of 1-mers, 2-mers and 3-mers in the sequence are first calculated, and the three k-mers are combined together for a total of 84 k-mers. Finally, the frequencies of the 84 k-mers are constructed into a 7 × 12 matrix and convoluted. In the model constructed in this paper, the classification label of lncRNA is 0, and the classification label of mRNA is 1. To predict the category to which the sequence belongs, the output model is finally required to identify the category label of the category to which the sequence belongs.</p>
<p id="Par46">The final prediction result is “pre_label is 0”, that is, the model recognizes the sequence as the lncRNA sequence, indicating that the model is correct for recognition of the sequence.</p>
</sec>
</sec>
<sec id="Sec9">
<title>Conclusions</title>
<p id="Par47">The main purpose of this paper is to construct a model that can effectively classify lncRNA and mRNA. First, based on the statistical analysis of the sample sequence length and k-mer frequency distribution, the lncRNA and mRNA sequences in the model training set are determined to range from 250 nt to 3500 nt and from 200 nt to 4000 nt, respectively, and a k-mer frequency matrix is constructed. Then, using the k-mer frequency matrix as input in the convolutional neural network, a classification model of lncRNAs and mRNAs is established and programmed using Python. By calculating the classification accuracy of the frequency matrix of different k-mer combinations, the classification accuracy of the model with 1-mers, 2-mers and 3-mers is highest with an accuracy of 0.9872. Comparing the established lncRNA and mRNA classification models with random forest, logistic regression, decision tree and support vector machine analyses using the ROC curve, the model classification effect is improved. Application of the model is then examined: the correct classification result is obtained by identifying the known lncRNA sequence GASL1.</p>
<p id="Par48">There remain many limitations to our research. For example, in the statistical analysis of the k-mers of lncRNA and mRNA sequences, only simple frequency analysis was used, and no in-depth statistical analysis was performed. In addition, when applying extensions to sequences of different species, the k-mer information difference between different species was not analysed in depth, but the preliminary discussion is based on the calculation results. In the future, we will conduct a systematic analysis of k-mer information differences between different species. In addition, as pointed out in [
<xref ref-type="bibr" rid="CR26">26</xref>
] user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful prediction methods and computational tools. Actually, many practically useful web-servers have significantly increased the impacts of bioinformatics on medical science [
<xref ref-type="bibr" rid="CR27">27</xref>
], driving medicinal chemistry into an unprecedented revolution [
<xref ref-type="bibr" rid="CR28">28</xref>
], we shall make efforts in our future work to provide a web-server for the prediction method presented in this paper.</p>
</sec>
<sec id="Sec10">
<title>Methods</title>
<sec id="Sec11">
<title>Statistical analysis of k-mers of lncRNA and mRNA sequences</title>
<p id="Par49">The human lncRNAs and mRNAs data were downloaded from the GENCODE database (Gencode.v26). The mouse lncRNAs and mRNAs data were downloaded from the GENCODE database (Genecode. VM21). The chicken lncRNAs and mRNAs data were downloaded from the Ensembl website database (5.0). Human data were used to build a convolutional neural network model. Mouse and chicken data were used to compare the superiority of the convolutional neural network model using the built model. The computed information used in this paper were as follows: (1) Operating system, Windows 10, with an InterCore I3–2365 M processor, memory size, 6G; (2) Python 3.5 to run the CNN code.</p>
<p id="Par50">Studies have shown that the k-mer frequency information in lncRNA and mRNA sequences can reveal the distribution of various subsequences in biological sequences, to measure the similarities and variances of sequences [
<xref ref-type="bibr" rid="CR29">29</xref>
].</p>
<p id="Par51">A k-mer refers to all possible subsequences of length k in a DNA sequence, RNA sequence or amino acid sequence. Figure 
<xref rid="Fig3" ref-type="fig">3</xref>
shows the process of examining a k-mer in a sliding window mode in a sequence when k is three, in which there are 21 3-mers, namely, GCC, CCA, CAA, AAC, ACG, CGC, GCC, CCA, CAG, AGG, GGC, GCC, CCG, CGA, GAC, ACC, CCA, CAG, AGT, GTT, and TTC. Among them, GCC and CCA appear three times, CAG appears twice, and the others appear once. Similarly, we can count the k-mer frequency information of lncRNA and mRNA sequences.
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>The 3-mer sliding window showing the process of taking a k-mer in sliding window mode in a sequence when k is three in which there are 21 3-mers</p>
</caption>
<graphic xlink:href="12859_2019_3039_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p id="Par52">For a sequence, if the sequence length is m, then the number of k-mer subsequence of length k has m-k + 1. The sequence generally consists of four bases, A, T, C, and G, and thus k-mers of length k have 4
<sup>
<italic>k</italic>
</sup>
possible structures.</p>
<p id="Par53">We randomly selected 5000 lncRNA sequence data and 5000 mRNA sequence data from the human data, and we counted the k-mer frequency information when k = 1 and k = 3. As shown in Figs.
<xref rid="Fig4" ref-type="fig">4</xref>
and
<xref rid="Fig5" ref-type="fig">5</xref>
, when k = 1, the higher the content of bases G and C, the higher is the thermal stability of DNA molecules. When k = 2, after analysing the preference of dinucleotides, the frequency statistics of dinucleotides may represent certain characteristics of different species in different environments. For example, CG may be a methyl-CpG island and TA may be part of the TATA box. When k = 3, the coding and non-coding region in the sequence can be distinguished by counting the codon usage preference consisting of three bases. Therefore, considering the statistics, we analysed the k-mer frequency of lncRNA and mRNA sequence samples separately to discuss their differences.
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>The 1-mer frequency distribution histogram. The contents of the A, C, G, and T bases in the lncRNA sequence are approximately 254 nt, 217 nt, 216 nt, and 240 nt, respectively, while the mRNA sequence has A, C, G, and T base contents of approximately 364 nt, 420 nt, 422 nt, and 343 nt, respectively, when we randomly select 5000 lncRNA sequence data and 5000 mRNA sequence data</p>
</caption>
<graphic xlink:href="12859_2019_3039_Fig4_HTML" id="MO4"></graphic>
</fig>
<fig id="Fig5">
<label>Fig. 5</label>
<caption>
<p>The 3-mer distribution frequency diagram of mRNA and lncRNA.
<bold>a</bold>
The 32 3-mer distribution frequency diagram beginning with T and A, and (
<bold>b</bold>
) the other 32 3-mer distribution frequency diagram beginning with G and C</p>
</caption>
<graphic xlink:href="12859_2019_3039_Fig5_HTML" id="MO5"></graphic>
</fig>
</p>
<p id="Par54">From Fig.
<xref rid="Fig4" ref-type="fig">4</xref>
, the average contents of the A, C, G, and T bases in the lncRNA sequence were approximately 254 nt, 217 nt, 216 nt, and 240 nt, respectively, while the mRNA sequence had A, C, G, and T base contents of approximately 364 nt, 420 nt, 422 nt, and 343 nt, respectively. Therefore, the contents of the four 1-mers in the mRNA were higher than in the lncRNA. Furthermore, in the lncRNA and mRNA sequences, the contents of C and G bases were very similar, and the contents of A and T base were equivalent. However, in the lncRNA sequence, the contents of C and G base were lower than the contents of A and T base. While in the mRNA sequence, the opposite was true.</p>
<p id="Par55">When k is taken as three, the 3-mer fragments appearing in the sequence are {AAA, AAT, AAC, AAG, ATA, ATT, ATC, ATG,..., GGA, GGT, GGC, GGG}. Similarly, the frequency information for each 3-mer segment of each sequence in the sample sequence is counted in turn, and the respective mean values are calculated to estimate the frequency of occurrence of each 3-mer in the lncRNA and mRNA sequences. The histogram of the 3-mer frequency distribution in the mRNA and lncRNA is plotted, as shown in Fig.
<xref rid="Fig5" ref-type="fig">5</xref>
a and b. As seen in Fig.
<xref rid="Fig5" ref-type="fig">5</xref>
a and b, the frequency distribution of most 3-mer subsequences in the mRNA sequence fluctuated sharply, but the frequency of 3-mers of GCG, CGG, CGC, CGT, CGA, TCG, and ACG was small, with only approximately 4 in each sequence. Moreover, the TGG, CAG, CTG, CCA, CCT, GCC, and GGC segments were enriched in the mRNA sequence, all with frequencies of approximately 40. In the lncRNA sequence, the frequency distribution of each 3-mer subsequence was relatively stable, and most of them were distributed approximately 15 with little fluctuation. A few 3-mers had a higher frequency, such as AAG, AGA, AGG, TGG, CAG, CTG, CCA, CCT, GAA, and GGA, exceeding 20. Moreover, only four 3-mers, ACG, TCG, CGA, and CGT, had frequencies lower than 5.</p>
<p id="Par56">By analysing the frequency distributions of 1-mers and 3-mers of lncRNA and mRNA sequences, the k-mer distributions of the two were found to have their own preferences. Therefore, we could use the k-mer frequency distribution information for the sequence as the difference information between lncRNA and mRNA sequences.</p>
</sec>
<sec id="Sec12">
<title>The lncRNA and mRNA classification model based on the convolutional neural network</title>
<p id="Par57">In this paper, a convolutional neural network algorithm was used to construct a model suitable for classifying gene sequences based on the transformation of sequences into the k-mer frequency matrix. The model framework is shown in Fig. 
<xref rid="Fig6" ref-type="fig">6</xref>
.
<fig id="Fig6">
<label>Fig. 6</label>
<caption>
<p>The lncRNA recognition model calculation flow chart. The lncRNA and mRNA classification model includes the input part and convolutional neural network part</p>
</caption>
<graphic xlink:href="12859_2019_3039_Fig6_HTML" id="MO6"></graphic>
</fig>
</p>
<p id="Par58">As shown in Fig.
<xref rid="Fig6" ref-type="fig">6</xref>
, the lncRNA and mRNA classification model includes the input part and the convolutional neural network part. The input section contains the k-mers extracted from the sequence and their construction into a k-mer frequency matrix. The convolutional neural network consists of two layers. First, in the k-mer feature extraction layer, the input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted. Once the local feature is extracted, its positional relationship with other k-mer features is also determined. The second is the k-mer feature mapping layer. Each computing layer of the network consists of multiple feature maps. Each feature map is a plane, and the weights of all neurons on the plane are equal. The k-mer feature mapping structure uses the sigmoid function as the activation function of the convolutional network so that the feature map has displacement invariance. In addition, since all neurons on a mapped surface share weights, the number of network parameters is reduced. Each convolutional layer in the convolutional neural network is followed by a computational layer for local averaging and quadratic extraction of k-mer features. This unique two-feature extraction structure reduces feature resolution.</p>
</sec>
<sec id="Sec13">
<title>Construction of the k-mer frequency matrix</title>
<p id="Par59">The k-mer frequency of each sequence is first normalized and converted to a frequency. Then, according to the application of the convolutional neural network in image recognition, the k-mer frequency of each sequence is constructed into a matrix form of the same size as the input of the model. Finally, the convolutional neural network is used to autonomously learn the difference between the two sequences of k-mer frequency information to achieve the purpose of classifying and identifying lncRNAs and mRNAs. The specific process is as follows:
<list list-type="bullet">
<list-item>
<p id="Par60">Step 1: The k-mer frequency information for each sequence is counted. In this paper, the sequence is traversed in the order of A, T, C, and G, and finally the frequency of each 4-mer is counted;</p>
</list-item>
<list-item>
<p id="Par61">Step 2: Normalize the frequency of each k-mer in the sequence in each sequence. That is, the frequency
<italic>p</italic>
<sub>
<italic>i</italic>
</sub>
(
<italic>i</italic>
 = 
<italic>n</italic>
, 
<italic>n</italic>
 = 4
<sup>
<italic>k</italic>
</sup>
) of all k-mers in each sequence is obtained, and the sum of the frequencies of k-mers in each sequence is 1;</p>
</list-item>
<list-item>
<p id="Par62">Step 3: The frequencies of all k-mers in each sequence are constructed into a matrix form
<italic>A</italic>
 × 
<italic>B</italic>
(
<italic>A</italic>
 × 
<italic>B</italic>
 = 4
<sup>
<italic>k</italic>
</sup>
), and the elements in the matrix are arranged horizontally in the order of the k-mer. For example, when k is 4, the constructed matrix is 16 × 16.</p>
</list-item>
</list>
</p>
</sec>
<sec id="Sec14">
<title>K-mer screening based on relative entropy</title>
<p id="Par63">When the value of k increases, the k-mer types with a length k in the sequence increase exponentially. If k is large, the average frequency of each k-mer will be less. Numerous k-mers have a frequency of zero. To reduce the complexity of the data calculation, we used relative entropy to screen k-mers.</p>
<p id="Par64">Let
<italic>p</italic>
<sub>ln
<italic>c</italic>
</sub>
be the frequency distribution of the k-mer in the lncRNA sequence and
<italic>p</italic>
<sub>
<italic>m</italic>
</sub>
be the frequency distribution of the k-mer in the mRNA sequence. The relative entropy of
<italic>p</italic>
<sub>ln
<italic>c</italic>
</sub>
and
<italic>p</italic>
<sub>
<italic>m</italic>
</sub>
is then
<disp-formula id="Equ1">
<label>1</label>
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ D\left({p}_{\ln c},{p}_m\right)={\sum}_{i=1}^{4^k}{p}_{\ln c}(i)\ln \frac{p_{\ln c}(i)}{p_m(i)},k\in \left[1,n\right],i\in \left[i,{4}^k\right], $$\end{document}</tex-math>
<mml:math id="M2" display="block">
<mml:mi>D</mml:mi>
<mml:mfenced close=")" open="(" separators=",">
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>ln</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:msubsup>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>ln</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced close=")" open="(">
<mml:mi>i</mml:mi>
</mml:mfenced>
<mml:mo>ln</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>ln</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced close=")" open="(">
<mml:mi>i</mml:mi>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mfenced close=")" open="(">
<mml:mi>i</mml:mi>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo></mml:mo>
<mml:mfenced close="]" open="[" separators=",">
<mml:mn>1</mml:mn>
<mml:mi>n</mml:mi>
</mml:mfenced>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:mfenced close="]" open="[" separators=",">
<mml:mi>i</mml:mi>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par65">If
<italic>p</italic>
<sub>ln
<italic>c</italic>
</sub>
 = 
<italic>p</italic>
<sub>
<italic>m</italic>
</sub>
, then
<italic>D</italic>
(
<italic>p</italic>
<sub>ln
<italic>c</italic>
</sub>
, 
<italic>p</italic>
<sub>
<italic>m</italic>
</sub>
) = 0, which indicates that the k-mer frequency distribution of the lncRNA sequence does not differ from the frequency distribution of the mRNA. If there is a difference in the k-mer frequency distribution between lncRNA and mRNA, then the value of
<italic>D</italic>
(
<italic>p</italic>
<sub>ln
<italic>c</italic>
</sub>
, 
<italic>p</italic>
<sub>
<italic>m</italic>
</sub>
) will be greater than zero. Concurrently, the smaller the value of
<italic>D</italic>
(
<italic>p</italic>
<sub>ln
<italic>c</italic>
</sub>
, 
<italic>p</italic>
<sub>
<italic>m</italic>
</sub>
) is, the smaller will be the difference in the k-mer frequency distribution between lncRNA and mRNA. Otherwise, the larger the value of
<italic>D</italic>
(
<italic>p</italic>
<sub>ln
<italic>c</italic>
</sub>
, 
<italic>p</italic>
<sub>
<italic>m</italic>
</sub>
) is, the greater will be the difference in the k-mer frequency distribution between lncRNA and mRNA. To screen out k-mers that increase the difference information, set
<disp-formula id="Equ2">
<label>2</label>
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {d}_{\lambda }={p}_{\ln c}(i)\ln \frac{p_{\ln c}(i)}{p_m(i)},\lambda \in \left[i,{4}^k\right], $$\end{document}</tex-math>
<mml:math id="M4" display="block">
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>λ</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>ln</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced close=")" open="(">
<mml:mi>i</mml:mi>
</mml:mfenced>
<mml:mo>ln</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>ln</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced close=")" open="(">
<mml:mi>i</mml:mi>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mfenced close=")" open="(">
<mml:mi>i</mml:mi>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mi>λ</mml:mi>
<mml:mo></mml:mo>
<mml:mfenced close="]" open="[" separators=",">
<mml:mi>i</mml:mi>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par66">Sorting
<italic>d</italic>
<sub>
<italic>λ</italic>
</sub>
in descending order obtains
<disp-formula id="Equ3">
<label>3</label>
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ R=\frac{\sum_{\lambda =1}^n{d}_{\lambda }}{D\left({p}_{\ln c},{p}_m\right)},n\in \left[1,{4}^k\right]. $$\end{document}</tex-math>
<mml:math id="M6" display="block">
<mml:mi>R</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>λ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>λ</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi>D</mml:mi>
<mml:mfenced close=")" open="(" separators=",">
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo>ln</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
</mml:mfenced>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo></mml:mo>
<mml:mfenced close="]" open="[" separators=",">
<mml:mn>1</mml:mn>
<mml:msup>
<mml:mn>4</mml:mn>
<mml:mi>k</mml:mi>
</mml:msup>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ3.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par67">
<italic>R</italic>
reflects the target ratio of the extracted information of k-mers. The
<italic>λ</italic>
corresponding to
<italic>R</italic>
from 1 to 4
<sup>
<italic>k</italic>
</sup>
is sequentially calculated. If set
<italic>R</italic>
 ≥ 98%, then calculate
<italic>λ</italic>
 = 
<italic>ϖ</italic>
, and the first
<italic>ϖ</italic>
k-mers are the filtered k-mers.</p>
<p id="Par68">Specific steps are as follows:
<list list-type="bullet">
<list-item>
<p id="Par69">Step 1: The sum of each k-mer in the lncRNA sample sequence and the mRNA sample sequence is separately determined. Then, the frequency of each k-mer in it is counted. Finally, the frequency values of the 4
<sup>
<italic>k</italic>
</sup>
kinds of k-mers in the sample are obtained. For example, when
<italic>k</italic>
 = 4, the total frequency of 256 4-mers in the lncRNA and mRNA sample sequences are respectively counted, then the frequency value of each 4-mer is counted, and finally, 256 kinds of 4-mers are obtained. The sum of the frequency values of the 256 4-mers is 1 for the frequency values in the two sample sequences, respectively. The results of this step calculation are
<italic>p</italic>
<sub>ln
<italic>c</italic>
</sub>
and
<italic>p</italic>
<sub>
<italic>m</italic>
</sub>
in the formula (
<xref rid="Equ1" ref-type="">1</xref>
).</p>
</list-item>
<list-item>
<p id="Par70">Step 2: According to the frequency value of each k-mer in the two sequences obtained in step 1, the relative entropy, that is, the value of
<italic>D</italic>
(
<italic>p</italic>
<sub>ln
<italic>c</italic>
</sub>
, 
<italic>p</italic>
<sub>
<italic>m</italic>
</sub>
), is calculated according to the formula (
<xref rid="Equ1" ref-type="">1</xref>
). Then,
<italic>R</italic>
is calculated according to the value of
<italic>D</italic>
(
<italic>p</italic>
<sub>ln
<italic>c</italic>
</sub>
, 
<italic>p</italic>
<sub>
<italic>m</italic>
</sub>
) in formula (
<xref rid="Equ3" ref-type="">3</xref>
), and finally,
<italic>λ</italic>
is taken as the value of
<italic>ϖ</italic>
when
<italic>R</italic>
 ≥ 98%. Now according to the descending order of
<italic>d</italic>
<sub>
<italic>λ</italic>
</sub>
, the first
<italic>ϖ</italic>
k-mers are the filtered k-mers.</p>
</list-item>
<list-item>
<p id="Par71">Step 3: The lncRNA and mRNA are separately counted based on the frequency of the k-mers screened in step 2, and then the k-mers frequency is constructed in the form of a matrix with reference to the steps of data input processing.</p>
</list-item>
</list>
</p>
</sec>
<sec id="Sec15">
<title>Convolution calculation of the k-mer frequency matrix</title>
<p id="Par72">The convolution calculation is used to strengthen the important features in the k-mer frequency matrix and weaken the influence of irrelevant k-mer features in this paper.</p>
<p id="Par73">Taking
<italic>k</italic>
 = 3 as an example, 64 k-mers can be extracted from an lncRNA sequence. According to the k-mer frequency, the lncRNA sequence can be constructed into an 8 × 8 k-mer frequency matrix
<italic>M</italic>
,
<disp-formula id="Equ4">
<label>4</label>
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ M=\left[\begin{array}{cccccccc}0.0059& 0.0102& 0.0117& 0.0190& 0.0059& 0.0029& 0.0059& 0.0102\\ {}0.0088& 0.0073& 0.0220& 0.0059& 0.0220& 0.0146& 0.0234& 0.0146\\ {}0.0044& 0.0044& 0.0088& 0.0088& 0.0073& 0.0102& 0.0176& 0.0161\\ {}0.0132& 0.0176& 0.0322& 0.0117& 0.0117& 0.0220& 0.0249& 0.0146\\ {}0.0161& 0.0044& 0.0102& 0.0337& 0.0088& 0.0264& 0.0293& 0.0264\\ {}0.0278& 0.0439& 0.0366& 0.0190& 0.0044& 0.0132& 0.0190& 0.0102\\ {}0.0205& 0.0073& 0.0117& 0.0176& 0.0044& 0.0117& 0.0220& 0.0190\\ {}0.0161& 0.0220& 0.0366& 0.0102& 0.0146& 0.0073& 0.0176& 0.0161\end{array}\right]. $$\end{document}</tex-math>
<mml:math id="M8" display="block">
<mml:mi>M</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfenced close="]" open="[">
<mml:mtable columnalign="center">
<mml:mtr>
<mml:mtd>
<mml:mn>0.0059</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0117</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0190</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0059</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0029</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0059</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0088</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0073</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0220</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0059</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0220</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0146</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0234</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0146</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0044</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0044</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0088</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0088</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0073</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0176</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0161</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0132</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0176</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0322</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0117</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0117</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0220</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0249</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0146</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0161</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0044</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0337</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0088</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0264</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0293</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0264</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0278</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0439</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0366</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0190</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0044</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0132</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0190</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0205</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0073</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0117</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0176</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0044</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0117</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0220</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0190</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0161</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0220</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0366</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0146</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0073</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0176</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0161</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ4.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par74">The convolution calculation is performed by taking Eq. (
<xref rid="Equ4" ref-type="">4</xref>
) as input and randomly setting a convolution kernel of 3 × 3 ,
<disp-formula id="Equ5">
<label>5</label>
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \mathrm{Kernel}=\left[\begin{array}{ccc}1& 0& 1\\ {}0& 1& 0\\ {}1& 0& 0\end{array}\right]. $$\end{document}</tex-math>
<mml:math id="M10" display="block">
<mml:mtext>Kernel</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mfenced close="]" open="[">
<mml:mtable columnalign="center">
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ5.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par75">The convolution calculation is actually a process of weighted summation. The calculation format is usually
<disp-formula id="Equ6">
<label>6</label>
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {X}_j^l=f\left(\sum \limits_{i={M}_j}{X}_i^{l-1}\times {Kernel}_{ij}^l+{B}^l\right), $$\end{document}</tex-math>
<mml:math id="M12" display="block">
<mml:msubsup>
<mml:mi>X</mml:mi>
<mml:mi>j</mml:mi>
<mml:mi>l</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mi>f</mml:mi>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:munder>
<mml:mo movablelimits="false"></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>M</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:msubsup>
<mml:mi>X</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo></mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>×</mml:mo>
<mml:msubsup>
<mml:mtext mathvariant="italic">Kernel</mml:mtext>
<mml:mi mathvariant="italic">ij</mml:mi>
<mml:mi>l</mml:mi>
</mml:msubsup>
<mml:mo>+</mml:mo>
<mml:msup>
<mml:mi>B</mml:mi>
<mml:mi>l</mml:mi>
</mml:msup>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ6.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<italic>f</italic>
is the activation function of the neurons in the layer, and
<italic>l</italic>
indicates the number of layers in the network. The kernel is the convolution kernel,
<italic>M</italic>
<sub>
<italic>j</italic>
</sub>
is a local area of the input object, and
<italic>B</italic>
represents the offset of each layer.</p>
<p id="Par76">In Eq. (
<xref rid="Equ5" ref-type="">5</xref>
), the convolution kernel has nine parameters. The k-mer frequency matrix in Eq. (
<xref rid="Equ4" ref-type="">4</xref>
) and the convolution kernel are convoluted by the calculation method of Eq. (
<xref rid="Equ6" ref-type="">6</xref>
).
<italic>B</italic>
<sup>
<italic>l</italic>
</sup>
is set as 0. The specific calculation process is as shown in Eq. (
<xref rid="Equ7" ref-type="">7</xref>
).
<disp-formula id="Equ7">
<label>7</label>
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\displaystyle \begin{array}{c}\mathrm{feature}\ \mathrm{map}=\left[\begin{array}{cccccccc}{0.0059}_{\times 1}& {0.0102}_{\times 0}& {0.0117}_{\times 1}& 0.0190& 0.0059& 0.0029& 0.0059& 0.0102\\ {}{0.0088}_{\times 0}& {0.0073}_{\times 1}& {0.0220}_{\times 0}& 0.0059& 0.0220& 0.0146& 0.0234& 0.0146\\ {}{0.0044}_{\times 1}& {0.0044}_{\times 0}& {0.0088}_{\times 0}& 0.0088& 0.0073& 0.0102& 0.0176& 0.0161\\ {}0.0132& 0.0176& 0.0322& 0.0117& 0.0117& 0.0220& 0.0249& 0.0146\\ {}0.0161& 0.0044& 0.0102& 0.0337& 0.0088& 0.0264& 0.0293& 0.0264\\ {}0.0278& 0.0439& 0.0366& 0.0190& 0.0044& 0.0132& 0.0190& 0.0102\\ {}0.0205& 0.0073& 0.0117& 0.0176& 0.0044& 0.0117& 0.0220& 0.0190\\ {}0.0161& 0.0220& 0.0366& 0.0102& 0.0146& 0.0073& 0.0176& 0.0161\end{array}\right]\\ {}=\left[\begin{array}{cccccc}0.0293& 0.0556& 0.0323& 0.0527& 0.0337& 0.0467\\ {}0.0484& 0.0396& 0.0850& 0.0495& 0.0673& 0.0688\\ {}0.0469& 0.0498& 0.0480& 0.0644& 0.0657& 0.0776\\ {}0.0776& 0.0834& 0.1142& 0.0615& 0.0674& 0.0791\\ {}0.0907& 0.0820& 0.0497& 0.0821& 0.0557& 0.0835\\ {}0.0878& 0.0966& 0.0952& 0.0468& 0.0497& 0.0527\end{array}\right]\\ {}\kern2.75em \end{array}} $$\end{document}</tex-math>
<mml:math id="M14" display="block">
<mml:mtable displaystyle="true" groupalign="right left">
<mml:mtr>
<mml:mtd>
<mml:mtext>feature</mml:mtext>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mi>map</mml:mi>
<mml:maligngroup></mml:maligngroup>
<mml:mo>=</mml:mo>
<mml:mfenced close="]" open="[">
<mml:mtable columnalign="center">
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mn>0.0059</mml:mn>
<mml:mrow>
<mml:mo>×</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:msub>
<mml:mn>0.0102</mml:mn>
<mml:mrow>
<mml:mo>×</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:msub>
<mml:mn>0.0117</mml:mn>
<mml:mrow>
<mml:mo>×</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0190</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0059</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0029</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0059</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mn>0.0088</mml:mn>
<mml:mrow>
<mml:mo>×</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:msub>
<mml:mn>0.0073</mml:mn>
<mml:mrow>
<mml:mo>×</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:msub>
<mml:mn>0.0220</mml:mn>
<mml:mrow>
<mml:mo>×</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0059</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0220</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0146</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0234</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0146</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:msub>
<mml:mn>0.0044</mml:mn>
<mml:mrow>
<mml:mo>×</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:msub>
<mml:mn>0.0044</mml:mn>
<mml:mrow>
<mml:mo>×</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:msub>
<mml:mn>0.0088</mml:mn>
<mml:mrow>
<mml:mo>×</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0088</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0073</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0176</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0161</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0132</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0176</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0322</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0117</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0117</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0220</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0249</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0146</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0161</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0044</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0337</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0088</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0264</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0293</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0264</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0278</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0439</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0366</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0190</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0044</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0132</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0190</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0205</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0073</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0117</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0176</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0044</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0117</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0220</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0190</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0161</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0220</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0366</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0102</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0146</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0073</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0176</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0161</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:mo>=</mml:mo>
<mml:mfenced close="]" open="[">
<mml:mtable columnalign="center">
<mml:mtr>
<mml:mtd>
<mml:mn>0.0293</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0556</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0323</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0527</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0337</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0467</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0484</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0396</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0850</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0495</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0673</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0688</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0469</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0498</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0480</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0644</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0657</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0776</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0776</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0834</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.1142</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0615</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0674</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0791</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0907</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0820</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0497</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0821</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0557</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0835</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0878</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0966</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0952</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0468</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0497</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0527</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mfenced>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:mspace width="2.75em"></mml:mspace>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ7.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par77">The first element of the feature map in Eq. (
<xref rid="Equ7" ref-type="">7</xref>
) is the weighted sum of the first 3 × 3 local element value of the input matrix
<italic>M</italic>
and the corresponding element of the convolution kernel. Similarly, the second element is the weighted sum of the second 3 × 3 local element value and the corresponding element of the convolution kernel. Finally, a 6 × 6 size of the output matrix feature map is obtained.</p>
<p id="Par78">The convolutional neural network of this model has two convolutional layers. The first convolutional layer uses 32 convolution kernels, and the second convolutional layer uses 64 convolution kernels. The size of each convolution kernel is 3 × 3, and the horizontal and vertical steps are 1. The border is filled with 0 in the samepadding to ensure that the size of the matrix remains the same as before convolution, i.e., the k-mer frequency matrix
<italic>M</italic>
after samepadding is
<disp-formula id="Equ8">
<label>8</label>
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {O}_j=\left[\begin{array}{cccccccc}0& 0& 0& 0& 0& 0& 0& 0\\ {}0& 0.0293& 0.0556& 0.0323& 0.0527& 0.0337& 0.0467& 0\\ {}0& 0.0484& 0.0396& 0.0850& 0.0495& 0.0673& 0.0688& 0\\ {}0& 0.0469& 0.0498& 0.0480& 0.0644& 0.0657& 0.0776& 0\\ {}0& 0.0776& 0.0834& 0.1142& 0.0615& 0.0674& 0.0791& 0\\ {}0& 0.0907& 0.0820& 0.0497& 0.0821& 0.0557& 0.0835& 0\\ {}0& 0.0878& 0.0966& 0.0952& 0.0468& 0.0497& 0.0527& 0\\ {}0& 0& 0& 0& 0& 0& 0& 0\end{array}\right]. $$\end{document}</tex-math>
<mml:math id="M16" display="block">
<mml:msub>
<mml:mi>O</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfenced close="]" open="[">
<mml:mtable columnalign="center">
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0293</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0556</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0323</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0527</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0337</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0467</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0484</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0396</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0850</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0495</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0673</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0688</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0469</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0498</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0480</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0644</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0657</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0776</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0776</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0834</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.1142</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0615</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0674</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0791</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0907</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0820</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0497</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0821</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0557</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0835</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0878</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0966</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0952</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0468</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0497</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0527</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ8.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
</sec>
<sec id="Sec16">
<title>Pooling calculation of the convolution kernel output matrix</title>
<p id="Par79">The pooling method adopted in this paper is the max pooling, which aims to reduce and compress the k-mer characteristics, as well as the calculation amount. The model has only one pooling layer. After the second convolutional layer, the window sliding calculation is performed in a step size of 2.</p>
<p id="Par80">For Eq. (
<xref rid="Equ8" ref-type="">8</xref>
), the maximum value of the first 2 × 2 block of matrix
<italic>O</italic>
<sub>
<italic>j</italic>
</sub>
is 0.0293. The final result of the pooled calculation of matrix
<italic>O</italic>
<sub>
<italic>j</italic>
</sub>
is
<disp-formula id="Equ9">
<label>9</label>
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {S}_j=\left[\begin{array}{cccc}0.0293& 0.0556& 0.0527& 0.0467\\ {}0.0484& 0.0850& 0.0673& 0.0776\\ {}0.0907& 0.1142& 0.0821& 0.0835\\ {}0.0878& 0.0966& 0.0497& 0.0527\end{array}\right]. $$\end{document}</tex-math>
<mml:math id="M18" display="block">
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfenced close="]" open="[">
<mml:mtable columnalign="center">
<mml:mtr>
<mml:mtd>
<mml:mn>0.0293</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0556</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0527</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0467</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0484</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0850</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0673</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0776</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0907</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.1142</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0821</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0835</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0.0878</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0966</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0497</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mn>0.0527</mml:mn>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ9.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par81">Since 64 k-mer matrices of 6 × 6 were obtained after the second convolutional layer and the size of the pooling window was 2 × 2, the number of rows and columns of the matrix became half the original, representing 64 k-mer 3 × 3 matrices.</p>
</sec>
<sec id="Sec17">
<title>Fully connected neural network based on SoftMax function</title>
<p id="Par82">In order to prevent over-fitting and improve the generalization ability of the model, we reduce the connection of some neurons with a certain probability, so that some neurons in each training are not activated. After the pooling layer, the connection between the output neurons of the pooling layer and the neurons of the full connective layer was reduced with a probability of 0.25. The output matrix of the pooling layer is flattened and expanded to connect 128 neurons in the full connection layer. The activation function is still Relu function.</p>
<p id="Par83">We used the SoftMax function to activate the output of the fully connected network in the model. The formula of the SoftMax function is
<disp-formula id="Equ10">
<label>10</label>
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ f\left({z}_j\right)=\frac{e^{z_j}}{\sum_{i=1}^n{e}^{z_i}}. $$\end{document}</tex-math>
<mml:math id="M20" display="block">
<mml:mi>f</mml:mi>
<mml:mfenced close=")" open="(">
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:msup>
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:msup>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ10.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par84">From Eq. (
<xref rid="Equ10" ref-type="">10</xref>
), if
<italic>z</italic>
<sub>
<italic>j</italic>
</sub>
is greater than the other
<italic>z</italic>
, the value of the function
<italic>f</italic>
(
<italic>z</italic>
<sub>
<italic>j</italic>
</sub>
) approaches 1, and otherwise it approaches 0. Therefore, when the value of
<italic>f</italic>
(
<italic>z</italic>
<sub>
<italic>j</italic>
</sub>
) is 1, the input sequence of this model is judged to be an lncRNA sequence, and when the value of
<italic>f</italic>
(
<italic>z</italic>
<sub>
<italic>j</italic>
</sub>
) is 0, the input sequence is a mRNA sequence.</p>
<p id="Par85">The Adadelta optimizer is used to train the gradient descent in the training process, and the cross-entropy loss function is used as the loss function.</p>
</sec>
<sec id="Sec18">
<title>Setting of the evaluation index in the classification model</title>
<p id="Par86">The indicator for evaluating the performance of a classification model is generally the classification accuracy, which is also known as the model accuracy. Commonly used evaluation indicators for the two-category problem are precision and recall. In this paper, the positive class was the lncRNA sequence, and the mRNA sequence was the negative class. The predictions of the classifier on the test data set were either correct or incorrect. The total number of occurrences in the four cases was recorded as follows:</p>
<p id="Par87">
<italic>TP</italic>
——The positive class is predicted as the positive class number;</p>
<p id="Par88">
<italic>FN</italic>
——The positive class is predicted as the negative class number;</p>
<p id="Par89">
<italic>FP</italic>
——The negative class is predicted as the positive class number;</p>
<p id="Par90">
<italic>TN</italic>
——The negative class is predicted as the negative class number.</p>
<p id="Par91">The precision rate is defined as
<disp-formula id="Equ11">
<label>11</label>
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ P=\frac{TP}{TP+ FP}, $$\end{document}</tex-math>
<mml:math id="M22" display="block">
<mml:mi>P</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FP</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ11.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par92">The recall rate is defined as
<disp-formula id="Equ12">
<label>12</label>
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ R=\frac{TP}{TP+ FN}. $$\end{document}</tex-math>
<mml:math id="M24" display="block">
<mml:mi>R</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FN</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>.</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ12.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par93">In addition, the
<italic>F</italic>
<sub>1</sub>
score is the harmonic mean of the precision rate and the recall rate, i.e.,
<disp-formula id="Equ13">
<label>13</label>
<alternatives>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{2}{F_1}=\frac{1}{P}+\frac{1}{R}, $$\end{document}</tex-math>
<mml:math id="M26" display="block">
<mml:mfrac>
<mml:mn>2</mml:mn>
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mfrac>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>P</mml:mi>
</mml:mfrac>
<mml:mo>+</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>R</mml:mi>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ13.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
<disp-formula id="Equ14">
<label>14</label>
<alternatives>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {F}_1=\frac{2 TP}{2 TP+ FP+ FN}, $$\end{document}</tex-math>
<mml:math id="M28" display="block">
<mml:msub>
<mml:mi>F</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">TP</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FN</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:math>
<graphic xlink:href="12859_2019_3039_Article_Equ14.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par94">If both the precision and recall rates are high, then
<italic>F</italic>
<sub>1</sub>
will be high [
<xref ref-type="bibr" rid="CR30">30</xref>
].</p>
</sec>
</sec>
</body>
<back>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>ANT</term>
<def>
<p id="Par4">Adjoining Nucleotide Triplets</p>
</def>
</def-item>
<def-item>
<term>AUC</term>
<def>
<p id="Par5">Area Under Curve</p>
</def>
</def-item>
<def-item>
<term>CNCI</term>
<def>
<p id="Par6">Coding-Non Coding Index</p>
</def>
</def-item>
<def-item>
<term>CPC</term>
<def>
<p id="Par7">Coding Potential Calculator</p>
</def>
</def-item>
<def-item>
<term>DT</term>
<def>
<p id="Par8">Decision Tree</p>
</def>
</def-item>
<def-item>
<term>lncRNA</term>
<def>
<p id="Par9">Long-chain non-coding RNA</p>
</def>
</def-item>
<def-item>
<term>LR</term>
<def>
<p id="Par10">Logistic regression</p>
</def>
</def-item>
<def-item>
<term>MLCDS</term>
<def>
<p id="Par11">Most-Like Coding Domain Sequence</p>
</def>
</def-item>
<def-item>
<term>mRNA</term>
<def>
<p id="Par12">Messenger RNA</p>
</def>
</def-item>
<def-item>
<term>RF</term>
<def>
<p id="Par13">Random Forest</p>
</def>
</def-item>
<def-item>
<term>ROC curve</term>
<def>
<p id="Par14">Receiver operating characteristic curve</p>
</def>
</def-item>
<def-item>
<term>SVM</term>
<def>
<p id="Par15">Support Vector Machine</p>
</def>
</def-item>
</def-list>
</glossary>
<fn-group>
<fn>
<p>
<bold>Publisher’s Note</bold>
</p>
<p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>The authors would like to thank Yebei Xu and Ling Yang for suggestions regarding this work.</p>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>BD and XX designed the study. JW, YL, YS, and HH carried out analyses and wrote the program. JW and YL wrote the paper. All authors read and approved the final manuscript.</p>
</notes>
<notes notes-type="funding-information">
<title>Funding</title>
<p>This work was jointly supported by the National Nature Science Foundation of China (61403288, 71871174), the Natural Science Foundation of Hubei Province, China (2019cfb589), and Fundamental Research Funds for the Central Universities (WUT: 2019IA005). The funders had no role in study design, data collection, analysis, decision to publish, or preparation of the manuscript.</p>
</notes>
<notes notes-type="data-availability">
<title>Availability of data and materials</title>
<p>The datasets used to perform the analysis are publicly available at
<ext-link ext-link-type="uri" xlink:href="https://www.gencodegenes.org/mouse/">https://www.gencodegenes.org/mouse/</ext-link>
</p>
</notes>
<notes>
<title>Ethics approval and consent to participate</title>
<p id="Par95">Not applicable.</p>
</notes>
<notes>
<title>Consent for publication</title>
<p id="Par96">Not applicable.</p>
</notes>
<notes notes-type="COI-statement">
<title>Competing interests</title>
<p id="Par97">The authors declare that they have no competing interests.</p>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Djebali</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Davis</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Merkel</surname>
<given-names>A</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Landscape of transcription in human cells</article-title>
<source>Nature.</source>
<year>2012</year>
<volume>489</volume>
<fpage>101</fpage>
<lpage>108</lpage>
<pub-id pub-id-type="doi">10.1038/nature11233</pub-id>
<pub-id pub-id-type="pmid">22955620</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wucher</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Legeai</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Hédan</surname>
<given-names>B</given-names>
</name>
<etal></etal>
</person-group>
<article-title>FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome</article-title>
<source>Nucleic Acids Res</source>
<year>2017</year>
<volume>45</volume>
<issue>8</issue>
<fpage>57</fpage>
<lpage>68</lpage>
</element-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Han</surname>
<given-names>SY</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>YC</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Long noncoding RNA identification: comparing machine learning based tools for long noncoding transcripts discrimination</article-title>
<source>Biomed Res Int</source>
<year>2016</year>
<volume>2016</volume>
<fpage>1</fpage>
<lpage>14</lpage>
</element-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>WS</given-names>
</name>
<name>
<surname>Xiao</surname>
<given-names>XW</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>H</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The research progress of LncRNA</article-title>
<source>J Gannan Med Univ</source>
<year>2017</year>
<volume>37</volume>
<issue>3</issue>
<fpage>433</fpage>
<lpage>437</lpage>
</element-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Caley</surname>
<given-names>DP</given-names>
</name>
<name>
<surname>Pink</surname>
<given-names>RC</given-names>
</name>
<name>
<surname>Truillano</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Long non-coding RNAs, chromatin and development</article-title>
<source>Sci World J</source>
<year>2010</year>
<volume>8</volume>
<issue>10</issue>
<fpage>90</fpage>
<lpage>102</lpage>
<pub-id pub-id-type="doi">10.1100/tsw.2010.7</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Nagano</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Mitchell</surname>
<given-names>JA</given-names>
</name>
<name>
<surname>Sanz</surname>
<given-names>LA</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The air noncoding RNA epigenetically silences transcription by targeting G9a to chromatin</article-title>
<source>Science.</source>
<year>2008</year>
<volume>322</volume>
<issue>5908</issue>
<fpage>1717</fpage>
<lpage>1720</lpage>
<pub-id pub-id-type="doi">10.1126/science.1163802</pub-id>
<pub-id pub-id-type="pmid">18988810</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Arai</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>X</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Induced ncRNAs allosterically modify RNA-binding proteins in cis to inhibit transcription</article-title>
<source>Nature.</source>
<year>2008</year>
<volume>454</volume>
<issue>7200</issue>
<fpage>126</fpage>
<lpage>130</lpage>
<pub-id pub-id-type="doi">10.1038/nature06992</pub-id>
<pub-id pub-id-type="pmid">18509338</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wapinski</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>HY</given-names>
</name>
</person-group>
<article-title>Corrigendum: long noncoding RNAs and human disease</article-title>
<source>Trends Cell Biol</source>
<year>2011</year>
<volume>21</volume>
<issue>6</issue>
<fpage>354</fpage>
<lpage>361</lpage>
<pub-id pub-id-type="doi">10.1016/j.tcb.2011.04.001</pub-id>
<pub-id pub-id-type="pmid">21550244</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kong</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>ZQ</given-names>
</name>
<etal></etal>
</person-group>
<article-title>CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine</article-title>
<source>Nucleic Acids Res</source>
<year>2007</year>
<volume>35</volume>
<fpage>345</fpage>
<lpage>349</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkm391</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Bu</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts</article-title>
<source>Nucleic Acids Res</source>
<year>2013</year>
<volume>41</volume>
<issue>17</issue>
<fpage>166</fpage>
<lpage>173</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkt646</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Dang</surname>
<given-names>HX</given-names>
</name>
</person-group>
<source>Multi-feature based long non-coding RNA recognition method</source>
<year>2013</year>
<publisher-loc>Xian</publisher-loc>
<publisher-name>Xidian University</publisher-name>
</element-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mariner</surname>
<given-names>PD</given-names>
</name>
<name>
<surname>Walters</surname>
<given-names>RD</given-names>
</name>
<name>
<surname>Espinoza</surname>
<given-names>CA</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Human Alu RNA is a modular transacting repressor of mRNA transcription during heat shock</article-title>
<source>Mol Cell</source>
<year>2008</year>
<volume>29</volume>
<issue>4</issue>
<fpage>499</fpage>
<lpage>509</lpage>
<pub-id pub-id-type="doi">10.1016/j.molcel.2007.12.013</pub-id>
<pub-id pub-id-type="pmid">18313387</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lin</surname>
<given-names>MF</given-names>
</name>
<name>
<surname>Jungreis</surname>
<given-names>I</given-names>
</name>
<name>
<surname>Kellis</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions</article-title>
<source>Bioinformatics.</source>
<year>2011</year>
<volume>27</volume>
<issue>13</issue>
<fpage>275</fpage>
<lpage>282</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btr209</pub-id>
<pub-id pub-id-type="pmid">21075743</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lertampaiporn</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Thammarongtham</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Nukoolkit</surname>
<given-names>C</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Identification of non-coding RNAs with a new composite feature in the hybrid random forest ensemble algorithm</article-title>
<source>Nucleic Acids Res</source>
<year>2014</year>
<volume>42</volume>
<issue>11</issue>
<fpage>93</fpage>
<lpage>104</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gku325</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Wei</surname>
<given-names>M</given-names>
</name>
</person-group>
<source>Identification of long non-coding RNA and mRNA based on maximum entropy and k-mer</source>
<year>2015</year>
<publisher-loc>Xian</publisher-loc>
<publisher-name>Xidian University</publisher-name>
</element-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qaisar</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Syed</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Azizuddin</surname>
<given-names>B</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A review of computational methods for finding non-coding rna genes</article-title>
<source>Genes.</source>
<year>2016</year>
<volume>7</volume>
<issue>12</issue>
<fpage>113</fpage>
<pub-id pub-id-type="doi">10.3390/genes7120113</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>X</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Short-term passenger flow prediction under passenger flow control using a dynamic radial basis function network</article-title>
<source>Appl Soft Comput</source>
<year>2019</year>
<volume>83</volume>
<fpage>105620</fpage>
<pub-id pub-id-type="doi">10.1016/j.asoc.2019.105620</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>F</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Air quality data clustering using EPLS method</article-title>
<source>Information Fusion</source>
<year>2017</year>
<volume>7</volume>
<issue>36</issue>
<fpage>225</fpage>
<lpage>232</lpage>
<pub-id pub-id-type="doi">10.1016/j.inffus.2016.11.015</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zeng</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Edwards</surname>
<given-names>MD</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Convolutional neural network architectures for predicting DNA-protein binding</article-title>
<source>Bioinformatics.</source>
<year>2016</year>
<volume>32</volume>
<issue>12</issue>
<fpage>121</fpage>
<lpage>127</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btw255</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alipanahi</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Delong</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Weirauch</surname>
<given-names>MT</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning</article-title>
<source>Nat Biotechnol</source>
<year>2015</year>
<volume>33</volume>
<issue>8</issue>
<fpage>831</fpage>
<lpage>838</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.3300</pub-id>
<pub-id pub-id-type="pmid">26213851</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21.</label>
<mixed-citation publication-type="other">Zhang Q, Zhu L, Huang DS. High-order convolutional neural network architecture for predicting DNA-protein binding sites. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(4):1184–92.</mixed-citation>
</ref>
<ref id="CR22">
<label>22.</label>
<mixed-citation publication-type="other">Zhang Q, Zhu L, Bao WZ, et al. Weakly-supervised convolutional neural network architecture for predicting protein-DNA binding. IEEE/ACM Trans Comput Biol Bioinform. 2018:1–1. Online. 10.1109/TCBB.2018.2864203.</mixed-citation>
</ref>
<ref id="CR23">
<label>23.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>DS</given-names>
</name>
</person-group>
<article-title>WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data</article-title>
<source>Sci Rep</source>
<year>2017</year>
<volume>7</volume>
<issue>1</issue>
<fpage>3217</fpage>
<pub-id pub-id-type="doi">10.1038/s41598-017-03554-7</pub-id>
<pub-id pub-id-type="pmid">28607381</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chuai</surname>
<given-names>GH</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>HH</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>JF</given-names>
</name>
<etal></etal>
</person-group>
<article-title>DeepCRISPR: optimized CRISPR guide RNA design by deep learning</article-title>
<source>Genome Biol</source>
<year>2018</year>
<volume>19</volume>
<issue>1</issue>
<fpage>80</fpage>
<pub-id pub-id-type="doi">10.1186/s13059-018-1459-4</pub-id>
<pub-id pub-id-type="pmid">29945655</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gasri-Plotnitsky</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ovadia</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Shamalov</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A novel lncRNA, GASL1, inhibits cell proliferation and restricts E2F1 activity</article-title>
<source>Oncotarget.</source>
<year>2017</year>
<volume>8</volume>
<issue>14</issue>
<fpage>23775</fpage>
<lpage>23786</lpage>
<pub-id pub-id-type="doi">10.18632/oncotarget.15864</pub-id>
<pub-id pub-id-type="pmid">28423601</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chou</surname>
<given-names>KC</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>HB</given-names>
</name>
</person-group>
<article-title>Recent advances in developing web-servers for predicting protein attributes</article-title>
<source>Nat Sci</source>
<year>2009</year>
<volume>1</volume>
<fpage>63</fpage>
<lpage>92</lpage>
</element-citation>
</ref>
<ref id="CR27">
<label>27.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chou</surname>
<given-names>KC</given-names>
</name>
</person-group>
<article-title>Impacts of bioinformatics to medicinal chemistry</article-title>
<source>Med Chem</source>
<year>2015</year>
<volume>11</volume>
<fpage>218</fpage>
<lpage>234</lpage>
<pub-id pub-id-type="doi">10.2174/1573406411666141229162834</pub-id>
<pub-id pub-id-type="pmid">25548930</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chou</surname>
<given-names>KC</given-names>
</name>
</person-group>
<article-title>An unprecedented revolution in medicinal chemistry driven by the progress of biological science</article-title>
<source>Curr Top Med Chem</source>
<year>2017</year>
<volume>17</volume>
<fpage>2337</fpage>
<lpage>2358</lpage>
<pub-id pub-id-type="doi">10.2174/1568026617666170414145508</pub-id>
<pub-id pub-id-type="pmid">28413951</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>X</given-names>
</name>
</person-group>
<source>Biological classification based on k-mer frequency statistics</source>
<year>2011</year>
<publisher-loc>Changchun</publisher-loc>
<publisher-name>Jilin University</publisher-name>
</element-citation>
</ref>
<ref id="CR30">
<label>30.</label>
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
</person-group>
<source>Statistics learning method</source>
<year>2012</year>
<publisher-loc>Beijing</publisher-loc>
<publisher-name>Peking University impress</publisher-name>
<fpage>18</fpage>
<lpage>19</lpage>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0002850 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0002850 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021