Serveur d'exploration SRAS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus

Identifieur interne : 000391 ( Pmc/Corpus ); précédent : 000390; suivant : 000392

Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus

Auteurs : Xiao-Li Qiang ; Peng Xu ; Gang Fang ; Wen-Bin Liu ; Zheng Kou

Source :

RBID : PMC:7093988

Abstract

Background

Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for early warning.

Methods

The spike protein sequences of 2666 coronaviruses were collected from 2019 Novel Coronavirus Resource (2019nCoVR) Database of China National Genomics Data Center on Jan 29, 2020. A total of 507 human-origin viruses were regarded as positive samples, whereas 2159 non-human-origin viruses were regarded as negative. To capture the key information of the spike protein, three feature encoding algorithms (amino acid composition, AAC; parallel correlation-based pseudo-amino-acid composition, PC-PseAAC and G-gap dipeptide composition, GGAP) were used to train 41 random forest models. The optimal feature with the best performance was identified by the multidimensional scaling method, which was used to explore the pattern of human coronavirus.

Results

The 10-fold cross-validation results showed that well performance was achieved with the use of the GGAP (g = 3) feature. The predictive model achieved the maximum ACC of 98.18% coupled with the Matthews correlation coefficient (MCC) of 0.9638. Seven clusters for human coronaviruses (229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV, and SARS-CoV-2) were found. The cluster for SARS-CoV-2 was very close to that for SARS-CoV, which suggests that both of viruses have the same human receptor (angiotensin converting enzyme II). The big gap in the distance curve suggests that the origin of SARS-CoV-2 is not clear and further surveillance in the field should be made continuously. The smooth distance curve for SARS-CoV suggests that its close relatives still exist in nature and public health is challenged as usual.

Conclusions

The optimal feature (GGAP, g = 3) performed well in terms of predicting infection risk and could be used to explore the evolutionary dynamic in a simple, fast and large-scale manner. The study may be beneficial for the surveillance of the genome mutation of coronavirus in the field.


Url:
DOI: 10.1186/s40249-020-00649-8
PubMed: 32209118
PubMed Central: 7093988

Links to Exploration step

PMC:7093988

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus</title>
<author>
<name sortKey="Qiang, Xiao Li" sort="Qiang, Xiao Li" uniqKey="Qiang X" first="Xiao-Li" last="Qiang">Xiao-Li Qiang</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xu, Peng" sort="Xu, Peng" uniqKey="Xu P" first="Peng" last="Xu">Peng Xu</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fang, Gang" sort="Fang, Gang" uniqKey="Fang G" first="Gang" last="Fang">Gang Fang</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Liu, Wen Bin" sort="Liu, Wen Bin" uniqKey="Liu W" first="Wen-Bin" last="Liu">Wen-Bin Liu</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kou, Zheng" sort="Kou, Zheng" uniqKey="Kou Z" first="Zheng" last="Kou">Zheng Kou</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">32209118</idno>
<idno type="pmc">7093988</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7093988</idno>
<idno type="RBID">PMC:7093988</idno>
<idno type="doi">10.1186/s40249-020-00649-8</idno>
<date when="2020">2020</date>
<idno type="wicri:Area/Pmc/Corpus">000391</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000391</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus</title>
<author>
<name sortKey="Qiang, Xiao Li" sort="Qiang, Xiao Li" uniqKey="Qiang X" first="Xiao-Li" last="Qiang">Xiao-Li Qiang</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xu, Peng" sort="Xu, Peng" uniqKey="Xu P" first="Peng" last="Xu">Peng Xu</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Fang, Gang" sort="Fang, Gang" uniqKey="Fang G" first="Gang" last="Fang">Gang Fang</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Liu, Wen Bin" sort="Liu, Wen Bin" uniqKey="Liu W" first="Wen-Bin" last="Liu">Wen-Bin Liu</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Kou, Zheng" sort="Kou, Zheng" uniqKey="Kou Z" first="Zheng" last="Kou">Zheng Kou</name>
<affiliation>
<nlm:aff id="Aff1"></nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Infectious Diseases of Poverty</title>
<idno type="ISSN">2095-5162</idno>
<idno type="eISSN">2049-9957</idno>
<imprint>
<date when="2020">2020</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p id="Par1">Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for early warning.</p>
</sec>
<sec>
<title>Methods</title>
<p id="Par2">The spike protein sequences of 2666 coronaviruses were collected from 2019 Novel Coronavirus Resource (2019nCoVR) Database of China National Genomics Data Center on Jan 29, 2020. A total of 507 human-origin viruses were regarded as positive samples, whereas 2159 non-human-origin viruses were regarded as negative. To capture the key information of the spike protein, three feature encoding algorithms (amino acid composition, AAC; parallel correlation-based pseudo-amino-acid composition, PC-PseAAC and G-gap dipeptide composition, GGAP) were used to train 41 random forest models. The optimal feature with the best performance was identified by the multidimensional scaling method, which was used to explore the pattern of human coronavirus.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par3">The 10-fold cross-validation results showed that well performance was achieved with the use of the GGAP (g = 3) feature. The predictive model achieved the maximum ACC of 98.18% coupled with the Matthews correlation coefficient (MCC) of 0.9638. Seven clusters for human coronaviruses (229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV, and SARS-CoV-2) were found. The cluster for SARS-CoV-2 was very close to that for SARS-CoV, which suggests that both of viruses have the same human receptor (angiotensin converting enzyme II). The big gap in the distance curve suggests that the origin of SARS-CoV-2 is not clear and further surveillance in the field should be made continuously. The smooth distance curve for SARS-CoV suggests that its close relatives still exist in nature and public health is challenged as usual.</p>
</sec>
<sec>
<title>Conclusions</title>
<p id="Par4">The optimal feature (GGAP, g = 3) performed well in terms of predicting infection risk and could be used to explore the evolutionary dynamic in a simple, fast and large-scale manner. The study may be beneficial for the surveillance of the genome mutation of coronavirus in the field.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Gorbalenya, A" uniqKey="Gorbalenya A">A Gorbalenya</name>
</author>
<author>
<name sortKey="Enjuanes, L" uniqKey="Enjuanes L">L Enjuanes</name>
</author>
<author>
<name sortKey="Ziebuhr, J" uniqKey="Ziebuhr J">J Ziebuhr</name>
</author>
<author>
<name sortKey="Snijder, E" uniqKey="Snijder E">E Snijder</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Corman, V" uniqKey="Corman V">V Corman</name>
</author>
<author>
<name sortKey="Muth, D" uniqKey="Muth D">D Muth</name>
</author>
<author>
<name sortKey="Niemeyer, D" uniqKey="Niemeyer D">D Niemeyer</name>
</author>
<author>
<name sortKey="Drosten, C" uniqKey="Drosten C">C Drosten</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cui, J" uniqKey="Cui J">J Cui</name>
</author>
<author>
<name sortKey="Li, F" uniqKey="Li F">F Li</name>
</author>
<author>
<name sortKey="Shi, Zl" uniqKey="Shi Z">ZL Shi</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Adams, M" uniqKey="Adams M">M Adams</name>
</author>
<author>
<name sortKey="Carstens, E" uniqKey="Carstens E">E Carstens</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Menachery, V" uniqKey="Menachery V">V Menachery</name>
</author>
<author>
<name sortKey="Yount, B" uniqKey="Yount B">B Yount</name>
</author>
<author>
<name sortKey="Debbink, K" uniqKey="Debbink K">K Debbink</name>
</author>
<author>
<name sortKey="Agnihothram, S" uniqKey="Agnihothram S">S Agnihothram</name>
</author>
<author>
<name sortKey="Gralinski, L" uniqKey="Gralinski L">L Gralinski</name>
</author>
<author>
<name sortKey="Plante, J" uniqKey="Plante J">J Plante</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qiang, Xl" uniqKey="Qiang X">XL Qiang</name>
</author>
<author>
<name sortKey="Kou, Z" uniqKey="Kou Z">Z Kou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Qiang, Xl" uniqKey="Qiang X">XL Qiang</name>
</author>
<author>
<name sortKey="Kou, Z" uniqKey="Kou Z">Z Kou</name>
</author>
<author>
<name sortKey="Fang, G" uniqKey="Fang G">G Fang</name>
</author>
<author>
<name sortKey="Wang, Y" uniqKey="Wang Y">Y Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Heald Sargent, T" uniqKey="Heald Sargent T">T Heald-Sargent</name>
</author>
<author>
<name sortKey="Gallagher, T" uniqKey="Gallagher T">T Gallagher</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, Wm" uniqKey="Zhao W">WM Zhao</name>
</author>
<author>
<name sortKey="Song, Sh" uniqKey="Song S">SH Song</name>
</author>
<author>
<name sortKey="Chen, Ml" uniqKey="Chen M">ML Chen</name>
</author>
<author>
<name sortKey="Zou, D" uniqKey="Zou D">D Zou</name>
</author>
<author>
<name sortKey="Ma, Ln" uniqKey="Ma L">LN Ma</name>
</author>
<author>
<name sortKey="Ma, Yk" uniqKey="Ma Y">YK Ma</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B Liu</name>
</author>
<author>
<name sortKey="Liu, F" uniqKey="Liu F">F Liu</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X Wang</name>
</author>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J Chen</name>
</author>
<author>
<name sortKey="Fang, L" uniqKey="Fang L">L Fang</name>
</author>
<author>
<name sortKey="Chou, K" uniqKey="Chou K">K Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Atchley, Wr" uniqKey="Atchley W">WR Atchley</name>
</author>
<author>
<name sortKey="Zhao, J" uniqKey="Zhao J">J Zhao</name>
</author>
<author>
<name sortKey="Fernandes, Ad" uniqKey="Fernandes A">AD Fernandes</name>
</author>
<author>
<name sortKey="Druke, T" uniqKey="Druke T">T Drüke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liaw, A" uniqKey="Liaw A">A Liaw</name>
</author>
<author>
<name sortKey="Wiener, M" uniqKey="Wiener M">M Wiener</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sing, T" uniqKey="Sing T">T Sing</name>
</author>
<author>
<name sortKey="Sander, O" uniqKey="Sander O">O Sander</name>
</author>
<author>
<name sortKey="Beerenwinkel, N" uniqKey="Beerenwinkel N">N Beerenwinkel</name>
</author>
<author>
<name sortKey="Lengauer, T" uniqKey="Lengauer T">T Lengauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Holshue, Ml" uniqKey="Holshue M">ML Holshue</name>
</author>
<author>
<name sortKey="Debolt, C" uniqKey="Debolt C">C DeBolt</name>
</author>
<author>
<name sortKey="Lindquist, S" uniqKey="Lindquist S">S Lindquist</name>
</author>
<author>
<name sortKey="Lofy, Kh" uniqKey="Lofy K">KH Lofy</name>
</author>
<author>
<name sortKey="Wiesman, J" uniqKey="Wiesman J">J Wiesman</name>
</author>
<author>
<name sortKey="Bruce, H" uniqKey="Bruce H">H Bruce</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Infect Dis Poverty</journal-id>
<journal-id journal-id-type="iso-abbrev">Infect Dis Poverty</journal-id>
<journal-title-group>
<journal-title>Infectious Diseases of Poverty</journal-title>
</journal-title-group>
<issn pub-type="ppub">2095-5162</issn>
<issn pub-type="epub">2049-9957</issn>
<publisher>
<publisher-name>BioMed Central</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">32209118</article-id>
<article-id pub-id-type="pmc">7093988</article-id>
<article-id pub-id-type="publisher-id">649</article-id>
<article-id pub-id-type="doi">10.1186/s40249-020-00649-8</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Qiang</surname>
<given-names>Xiao-Li</given-names>
</name>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Xu</surname>
<given-names>Peng</given-names>
</name>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Fang</surname>
<given-names>Gang</given-names>
</name>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Wen-Bin</given-names>
</name>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Kou</surname>
<given-names>Zheng</given-names>
</name>
<address>
<email>kouzhengcn@foxmail.com</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="GRID">grid.411863.9</institution-id>
<institution-id institution-id-type="ISNI">0000 0001 0067 3588</institution-id>
<institution>Institute of Computing Science and Technology, Guangzhou University,</institution>
</institution-wrap>
Guangzhou, 510006 China</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>25</day>
<month>3</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>25</day>
<month>3</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>9</volume>
<elocation-id>33</elocation-id>
<history>
<date date-type="received">
<day>6</day>
<month>2</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>16</day>
<month>3</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2020</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
. The Creative Commons Public Domain Dedication waiver (
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated in a credit line to the data.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<sec>
<title>Background</title>
<p id="Par1">Coronavirus can cross the species barrier and infect humans with a severe respiratory syndrome. SARS-CoV-2 with potential origin of bat is still circulating in China. In this study, a prediction model is proposed to evaluate the infection risk of non-human-origin coronavirus for early warning.</p>
</sec>
<sec>
<title>Methods</title>
<p id="Par2">The spike protein sequences of 2666 coronaviruses were collected from 2019 Novel Coronavirus Resource (2019nCoVR) Database of China National Genomics Data Center on Jan 29, 2020. A total of 507 human-origin viruses were regarded as positive samples, whereas 2159 non-human-origin viruses were regarded as negative. To capture the key information of the spike protein, three feature encoding algorithms (amino acid composition, AAC; parallel correlation-based pseudo-amino-acid composition, PC-PseAAC and G-gap dipeptide composition, GGAP) were used to train 41 random forest models. The optimal feature with the best performance was identified by the multidimensional scaling method, which was used to explore the pattern of human coronavirus.</p>
</sec>
<sec>
<title>Results</title>
<p id="Par3">The 10-fold cross-validation results showed that well performance was achieved with the use of the GGAP (g = 3) feature. The predictive model achieved the maximum ACC of 98.18% coupled with the Matthews correlation coefficient (MCC) of 0.9638. Seven clusters for human coronaviruses (229E, NL63, OC43, HKU1, MERS-CoV, SARS-CoV, and SARS-CoV-2) were found. The cluster for SARS-CoV-2 was very close to that for SARS-CoV, which suggests that both of viruses have the same human receptor (angiotensin converting enzyme II). The big gap in the distance curve suggests that the origin of SARS-CoV-2 is not clear and further surveillance in the field should be made continuously. The smooth distance curve for SARS-CoV suggests that its close relatives still exist in nature and public health is challenged as usual.</p>
</sec>
<sec>
<title>Conclusions</title>
<p id="Par4">The optimal feature (GGAP, g = 3) performed well in terms of predicting infection risk and could be used to explore the evolutionary dynamic in a simple, fast and large-scale manner. The study may be beneficial for the surveillance of the genome mutation of coronavirus in the field.</p>
</sec>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Coronavirus</kwd>
<kwd>Cross-species infection</kwd>
<kwd>Spike protein</kwd>
<kwd>Machine learning</kwd>
</kwd-group>
<funding-group>
<award-group>
<funding-source>
<institution-wrap>
<institution-id institution-id-type="FundRef">http://dx.doi.org/10.13039/501100001809</institution-id>
<institution>National Natural Science Foundation of China</institution>
</institution-wrap>
</funding-source>
<award-id>61972109</award-id>
<principal-award-recipient>
<name>
<surname>Kou</surname>
<given-names>Zheng</given-names>
</name>
</principal-award-recipient>
</award-group>
</funding-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2020</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Background</title>
<p id="Par22">Coronavirus (CoV) belongs to the order Nidovirales and can infect humans, mammals, and birds [
<xref ref-type="bibr" rid="CR1">1</xref>
]. The viral genome is composed of a positive stranded RNA, and its structures vary. The family Coronavirinae is divided into four genera: α, β, γ, and δ [
<xref ref-type="bibr" rid="CR2">2</xref>
]. There are seven human coronaviruses: 229E (α-CoV), NL63 (α-CoV), OC43 (β-CoV), HKU1 (β-CoV), MERS-CoV (β-CoV), SARS-CoV (β-CoV), and SARS-CoV-2 (β-CoV). MERS-CoV, SARS-CoV and SARS-CoV-2 can infect humans and induce serious pneumonia with many fatal cases [
<xref ref-type="bibr" rid="CR3">3</xref>
]. SARS-CoVs induced an epidemic in the world, and 774 fatal cases were reported [
<xref ref-type="bibr" rid="CR3">3</xref>
]. Now, SARS-CoV-2 is still circulating in China [
<xref ref-type="bibr" rid="CR4">4</xref>
<xref ref-type="bibr" rid="CR6">6</xref>
].</p>
<p id="Par23">As considerable coronaviruses have been isolated from bats and other animals, it is believed that there is a viral gene reservoir in wild animals [
<xref ref-type="bibr" rid="CR7">7</xref>
]. Coronavirus can directly cross the species barrier and infect humans with high fatality [
<xref ref-type="bibr" rid="CR8">8</xref>
]. As the antigen is novel for a human host, public health is being seriously challenged. The infection risk of coronavirus in animals should be analyzed and a prediction model should be constructed for early warning. For this purpose, machine-learning methods appear to be ideal tools [
<xref ref-type="bibr" rid="CR9">9</xref>
,
<xref ref-type="bibr" rid="CR10">10</xref>
]. The spike protein on the surface of the viral particle plays key roles in the binding of the cell receptor and membrane fusion [
<xref ref-type="bibr" rid="CR3">3</xref>
,
<xref ref-type="bibr" rid="CR11">11</xref>
], by which the host range is firmly determined [
<xref ref-type="bibr" rid="CR8">8</xref>
]. In this study, we screened the features of the spike protein using three encoding algorithms and predicted the cross-species infection of coronaviruses with the random forest method. Moreover, the optimal feature (G-gap dipeptide composition, GGAP, g = 3) was used to explore the dynamic of evolution in a simple, fast and massive manner.</p>
</sec>
<sec id="Sec2">
<title>Methods</title>
<sec id="Sec3">
<title>Dataset</title>
<p id="Par24">The protein sequences of 2666 coronaviruses were collected from 2019 Novel Coronavirus Resource (2019nCoVR) Database of China National Genomics Data Center (NGDC,
<ext-link ext-link-type="uri" xlink:href="https://bigd.big.ac.cn/ncov">https://bigd.big.ac.cn/ncov</ext-link>
) on Jan 29, 2020 [
<xref ref-type="bibr" rid="CR12">12</xref>
]. These strains had full length genomes and were isolated between 1941 and 2020, and included SARS-CoV-2 strains. The information related to these strains was summarized in Additional file 
<xref rid="MOESM1" ref-type="media">1</xref>
. The 507 human-origin coronaviruses were regarded as positive samples, whereas the 2159 non-human-origin coronaviruses were regarded as negative.</p>
</sec>
<sec id="Sec4">
<title>Feature encoding algorithms</title>
<p id="Par25">To capture the key information of the spike protein, we used three encoding algorithms from multiple perspectives, that is compositional information, position-related information and physicochemical properties (Table 
<xref rid="Tab1" ref-type="table">1</xref>
). The optimal feature with the best performance was shown by the multidimensional scaling method in R (MDS,
<ext-link ext-link-type="uri" xlink:href="https://cran.r-project.org/web/packages/MASS/index.html">https://cran.r-project.org/web/packages/MASS/index.html</ext-link>
). The details of the feature encoding algorithms used to encode the spike protein into feature vectors are listed below.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Summary of feature descriptors</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>Feature</th>
<th>Type</th>
<th>Dimension</th>
<th>Feature</th>
<th>Type</th>
<th>Dimension</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>PseAAC (λ = 1)</td>
<td>21</td>
<td>22</td>
<td>GGAP (g = 0)</td>
<td>400</td>
</tr>
<tr>
<td>2</td>
<td>PseAAC (λ = 2)</td>
<td>22</td>
<td>23</td>
<td>GGAP (g = 1)</td>
<td>400</td>
</tr>
<tr>
<td>3</td>
<td>PseAAC (λ = 3)</td>
<td>23</td>
<td>24</td>
<td>GGAP (g = 2)</td>
<td>400</td>
</tr>
<tr>
<td>4</td>
<td>PseAAC (λ = 4)</td>
<td>24</td>
<td>25</td>
<td>GGAP (g = 3)</td>
<td>400</td>
</tr>
<tr>
<td>5</td>
<td>PseAAC (λ = 5)</td>
<td>25</td>
<td>26</td>
<td>GGAP (g = 4)</td>
<td>400</td>
</tr>
<tr>
<td>6</td>
<td>PseAAC (λ = 6)</td>
<td>26</td>
<td>27</td>
<td>GGAP (g = 5)</td>
<td>400</td>
</tr>
<tr>
<td>7</td>
<td>PseAAC (λ = 7)</td>
<td>27</td>
<td>28</td>
<td>GGAP (g = 6)</td>
<td>400</td>
</tr>
<tr>
<td>8</td>
<td>PseAAC (λ = 8)</td>
<td>28</td>
<td>29</td>
<td>GGAP (g = 7)</td>
<td>400</td>
</tr>
<tr>
<td>9</td>
<td>PseAAC (λ = 9)</td>
<td>29</td>
<td>30</td>
<td>GGAP (g = 8)</td>
<td>400</td>
</tr>
<tr>
<td>10</td>
<td>PseAAC (λ = 10)</td>
<td>30</td>
<td>31</td>
<td>GGAP (g = 9)</td>
<td>400</td>
</tr>
<tr>
<td>11</td>
<td>PseAAC (λ = 11)</td>
<td>31</td>
<td>32</td>
<td>GGAP (g = 10)</td>
<td>400</td>
</tr>
<tr>
<td>12</td>
<td>PseAAC (λ = 12)</td>
<td>32</td>
<td>33</td>
<td>GGAP (g = 11)</td>
<td>400</td>
</tr>
<tr>
<td>13</td>
<td>PseAAC (λ = 13)</td>
<td>33</td>
<td>34</td>
<td>GGAP (g = 12)</td>
<td>400</td>
</tr>
<tr>
<td>14</td>
<td>PseAAC (λ = 14)</td>
<td>34</td>
<td>35</td>
<td>GGAP (g = 13)</td>
<td>400</td>
</tr>
<tr>
<td>15</td>
<td>PseAAC (λ = 15)</td>
<td>35</td>
<td>36</td>
<td>GGAP (g = 14)</td>
<td>400</td>
</tr>
<tr>
<td>16</td>
<td>PseAAC (λ = 16)</td>
<td>36</td>
<td>37</td>
<td>GGAP (g = 15)</td>
<td>400</td>
</tr>
<tr>
<td>17</td>
<td>PseAAC (λ = 17)</td>
<td>37</td>
<td>38</td>
<td>GGAP (g = 16)</td>
<td>400</td>
</tr>
<tr>
<td>18</td>
<td>PseAAC (λ = 18)</td>
<td>38</td>
<td>39</td>
<td>GGAP (g = 17)</td>
<td>400</td>
</tr>
<tr>
<td>19</td>
<td>PseAAC (λ = 19)</td>
<td>39</td>
<td>40</td>
<td>GGAP (g = 18)</td>
<td>400</td>
</tr>
<tr>
<td>20</td>
<td>PseAAC (λ = 20)</td>
<td>40</td>
<td>41</td>
<td>GGAP (g = 19)</td>
<td>400</td>
</tr>
<tr>
<td>21</td>
<td>AAC</td>
<td>20</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<italic>GGAP</italic>
G-gap dipeptide composition,
<italic>PseAAC</italic>
Pseudo-amino-acid composition,
<italic>AAC</italic>
Amino acid composition</p>
</table-wrap-foot>
</table-wrap>
</p>
<sec id="Sec5">
<title>Amino acid composition</title>
<p id="Par26">Amino acid composition (AAC) is a simple but commonly used feature descriptor for sequence analysis and model construction. For a total of 20 amino acid types, the AAC descriptor calculates the frequency of each type of amino acid. For example, if the amino acid type i occurs n
<sub>i</sub>
times in the protein sequence, then the frequency of i is denoted by f(i) = n
<sub>i</sub>
/L, where L is the protein length. For a given strain, we yielded a 20-dimensional feature vector by computing the frequencies of 20 different amino acids.</p>
</sec>
<sec id="Sec6">
<title>Parallel correlation-based pseudo-amino-acid composition</title>
<p id="Par27">Parallel correlation-based pseudo-amino-acid composition (PC-PseAAC) measures the parallel correlation between any two amino acids in a protein sequence [
<xref ref-type="bibr" rid="CR13">13</xref>
]. For a given strain P, the PC-PseAAC feature vector is represented by
<disp-formula id="Equa">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ PC- PseAAC={\left[{fv}_1,\dots, {fv}_{20},{fv}_{20+1},\dots, {fv}_{21+\uplambda}\right]}^T $$\end{document}</tex-math>
<mml:math id="M2" display="block">
<mml:mi mathvariant="italic">PC</mml:mi>
<mml:mo></mml:mo>
<mml:mtext mathvariant="italic">PseAAC</mml:mtext>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mfenced close="]" open="[" separators=",,,,,">
<mml:msub>
<mml:mi mathvariant="italic">fv</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="italic">fv</mml:mi>
<mml:mn>20</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="italic">fv</mml:mi>
<mml:mrow>
<mml:mn>20</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi mathvariant="italic">fv</mml:mi>
<mml:mrow>
<mml:mn>21</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">λ</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mi>T</mml:mi>
</mml:msup>
</mml:math>
<graphic xlink:href="40249_2020_649_Article_Equa.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<disp-formula id="Equb">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {fv}_u=\left\{\begin{array}{c}\frac{f_u}{\sum_{i=1}^{20}{f}_i+w{\sum}_{j=1}^{\uplambda}{\theta}_j},1\le u\le 20\\ {}\frac{w{\theta}_{u-20}}{\sum_{i=1}^{20}{f}_i+w{\sum}_{j=1}^{\uplambda}{\theta}_j},20+1\le u\le 20+\uplambda \end{array}\right. $$\end{document}</tex-math>
<mml:math id="M4" display="block">
<mml:msub>
<mml:mi mathvariant="italic">fv</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfenced open="{">
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:mfrac>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>u</mml:mi>
</mml:msub>
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mn>20</mml:mn>
</mml:msubsup>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi mathvariant="normal">λ</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>θ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>u</mml:mi>
<mml:mo></mml:mo>
<mml:mn>20</mml:mn>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:mfrac>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:msub>
<mml:mi>θ</mml:mi>
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo></mml:mo>
<mml:mn>20</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mn>20</mml:mn>
</mml:msubsup>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>+</mml:mo>
<mml:mi>w</mml:mi>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi mathvariant="normal">λ</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mi>θ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:mn>20</mml:mn>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>u</mml:mi>
<mml:mo></mml:mo>
<mml:mn>20</mml:mn>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="normal">λ</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mfenced>
</mml:math>
<graphic xlink:href="40249_2020_649_Article_Equb.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where u is an integer; fv
<sub>u</sub>
(1 ≤ u ≤ 20) represents the normalized appearance frequency of the 20 amino acids in the spike protein of P; λ represents the highest tier of the correlation along P; and θj (j = 1, 2, ..., λ) is the correlation function that measures the j-tier sequence-order correlation between all the j-th most contiguous residues along P. θj is calculated using the following formula:
<disp-formula id="Equc">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\theta}_j=\frac{1}{L}\ \sum \limits_{i=1}^L\frac{1}{5}\sum \limits_{m=1}^5{\left[{H}_m\left({P}_{i+j}\right)-{H}_m\left({P}_i\right)\right]}^2 $$\end{document}</tex-math>
<mml:math id="M6" display="block">
<mml:msub>
<mml:mi>θ</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>L</mml:mi>
</mml:mfrac>
<mml:mspace width="0.25em"></mml:mspace>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>L</mml:mi>
</mml:munderover>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>5</mml:mn>
</mml:mfrac>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mn>5</mml:mn>
</mml:munderover>
<mml:msup>
<mml:mfenced close="]" open="[">
<mml:mrow>
<mml:msub>
<mml:mi>H</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mfenced close=")" open="(">
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mfenced>
<mml:mo></mml:mo>
<mml:msub>
<mml:mi>H</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mfenced close=")" open="(">
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:math>
<graphic xlink:href="40249_2020_649_Article_Equc.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where Hm (Pi) (m = 1,2,3,4,5) represents the polarity, secondary structure, molecular volume, codon diversity, and electrostatic charge corresponding to the i-th amino acid Pi in the protein sequence P, respectively [
<xref ref-type="bibr" rid="CR14">14</xref>
]. If I + j > L, then I + j equals I + j - L.</p>
</sec>
<sec id="Sec7">
<title>G-gap dipeptide composition</title>
<p id="Par28">The G-gap dipeptide composition (GGAP) achieves the dipeptide composition coupled with local order information of any two interval residues within the spike sequence. It is formulated as
<disp-formula id="Equd">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ GGAP(g)=\left({fv}_1^g,{fv}_2^g,\dots, {fv}_{400}^g\right) $$\end{document}</tex-math>
<mml:math id="M8" display="block">
<mml:mtext mathvariant="italic">GGAP</mml:mtext>
<mml:mfenced close=")" open="(">
<mml:mi>g</mml:mi>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfenced close=")" open="(" separators=",,,">
<mml:msubsup>
<mml:mi mathvariant="italic">fv</mml:mi>
<mml:mn>1</mml:mn>
<mml:mi>g</mml:mi>
</mml:msubsup>
<mml:msubsup>
<mml:mi mathvariant="italic">fv</mml:mi>
<mml:mn>2</mml:mn>
<mml:mi>g</mml:mi>
</mml:msubsup>
<mml:mo></mml:mo>
<mml:msubsup>
<mml:mi mathvariant="italic">fv</mml:mi>
<mml:mn>400</mml:mn>
<mml:mi>g</mml:mi>
</mml:msubsup>
</mml:mfenced>
</mml:math>
<graphic xlink:href="40249_2020_649_Article_Equd.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {fv}_i^g $$\end{document}</tex-math>
<mml:math id="M10" display="inline">
<mml:msubsup>
<mml:mi mathvariant="italic">fv</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="40249_2020_649_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the occurrence frequency of the i-th (i = 1,2, ...,400) G-gap dipeptide, which is computed as
<disp-formula id="Eque">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {fv}_i^g=\frac{O_i^g}{\sum_{i=1}^{400}{O}_i^g} $$\end{document}</tex-math>
<mml:math id="M12" display="block">
<mml:msubsup>
<mml:mi mathvariant="italic">fv</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:msubsup>
<mml:mi>O</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
</mml:msubsup>
<mml:mrow>
<mml:msubsup>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mn>400</mml:mn>
</mml:msubsup>
<mml:msubsup>
<mml:mi>O</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:mfrac>
</mml:math>
<graphic xlink:href="40249_2020_649_Article_Eque.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {O}_i^g $$\end{document}</tex-math>
<mml:math id="M14" display="inline">
<mml:msubsup>
<mml:mi>O</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>g</mml:mi>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="40249_2020_649_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
represents the occurrence number of the i-th G-gap dipeptide in the spike protein. The dimension of the GGAP feature vector is 20 × 20 = 400.</p>
</sec>
</sec>
<sec id="Sec8">
<title>Machine learning</title>
<p id="Par29">The framework for the overall prediction is shown in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
. Two main steps are included: feature representation and machine learning. First, feature representations from three feature descriptors are achieved using the algorithm as described above. Second, the random forest (RF) method is used to train and test the prediction models.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Schematic framework of machine learning. First, feature representations from three feature descriptors are obtained. Second, the RF method is used to train and test the dataset and make predictions for cross-species transmission of coronavirus. NGDC: National Genomics Data Center; AAC: Amino acid composition; PC-PseAAC: Parallel correlation-based pseudo-amino-acid composition; GGAP: G-gap dipeptide composition; RF: Random forest</p>
</caption>
<graphic xlink:href="40249_2020_649_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p id="Par30">As robust and well performance in the field of machine learning, the RF has been widely used to model biological data. In this study, the RF algorithm is used to construct models and make predictions for the cross-species transmission of coronavirus. The RF behaves like an ensemble algorithm and proposes a set of decision trees, which are grown by a subset of features. The RF repeats the computing process many times and then makes a final prediction on each sample. The final prediction can simply be the mean of each prediction with bootstrapping algorithm. In this study, the RF algorithm in the R environment was used [
<xref ref-type="bibr" rid="CR15">15</xref>
]. All the experiments in the study were conducted under R 3.5.0 with default parameters (tree number = 500). To reduce the bias of unbalanced sample number, the positive samples were increased fourfold by the direct duplication of their protein sequences. The 10-fold cross validation method was used to evaluate the predictive performance. Platt scaling was used to transform the output of the RF model into a probability over two classes and evaluated the infection risk of coronaviruses.</p>
</sec>
<sec id="Sec9">
<title>Performance evaluation metrics</title>
<p id="Par31">Four commonly used metrics for model performance evaluation, that is, sensitivity (SN), specificity (SP), accuracy (ACC) and Matthews correlation coefficient (MCC), were used in the study. The details are listed as follows:
<disp-formula id="Equf">
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \left\{\begin{array}{c} SN=\frac{TP}{TP+ FN}\times 100\%\\ {} SP=\frac{TN}{TN+ FP}\times 100\%\\ {} ACC=\frac{TP+ TN}{TP+ TN+ FP+ FN}\times 100\%\\ {} MCC=\frac{TP\times TN+ FP\times FN}{\sqrt{\left( TP+ FN\right)\ \left( TP+ FP\right)\ \left( TN+ FN\right)\ \left( TN+ FP\right)}}\end{array}\right. $$\end{document}</tex-math>
<mml:math id="M16" display="block">
<mml:mfenced open="{">
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:mi mathvariant="italic">SN</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FN</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>×</mml:mo>
<mml:mn>100</mml:mn>
<mml:mo>%</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:mi mathvariant="italic">SP</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi mathvariant="italic">TN</mml:mi>
<mml:mrow>
<mml:mi mathvariant="italic">TN</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FP</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>×</mml:mo>
<mml:mn>100</mml:mn>
<mml:mo>%</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:mi mathvariant="italic">ACC</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">TN</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">TN</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FN</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mo>×</mml:mo>
<mml:mn>100</mml:mn>
<mml:mo>%</mml:mo>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:maligngroup></mml:maligngroup>
<mml:mi mathvariant="italic">MCC</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">TN</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FP</mml:mi>
<mml:mo>×</mml:mo>
<mml:mi mathvariant="italic">FN</mml:mi>
</mml:mrow>
<mml:msqrt>
<mml:mrow>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FN</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mi mathvariant="italic">TP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FP</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mi mathvariant="italic">TN</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FN</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mspace width="0.25em"></mml:mspace>
<mml:mfenced close=")" open="(">
<mml:mrow>
<mml:mi mathvariant="italic">TN</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">FP</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msqrt>
</mml:mfrac>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mfenced>
</mml:math>
<graphic xlink:href="40249_2020_649_Article_Equf.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where TP indicates true positive, which is the number of correctly predicted true strains with the phenotype of cross-species transmission; TN represents true negative, which is the number of correctly predicted true strains without the phenotype of cross-species transmission; FP represents false positive, which is the number of strains without the phenotype of cross-species transmission predicted to be strains with the phenotype of cross-species transmission; and FN represents false negative, which is the number of strains with the phenotype of cross-species transmission predicted to be strains without the phenotype of cross-species transmission. The SE and SP metrics measure the predictive ability of the model for positive and negative cases, respectively. The other two measures, ACC and MCC, are used to evaluate the overall performance of the model. Regarding all the metrics above, the higher their scores, the better performance of the model have.</p>
<p id="Par32">In this study, we also used the receiver operating characteristic curve (ROC) to evaluate the overall performance of a binary classifier system [
<xref ref-type="bibr" rid="CR16">16</xref>
]. It is generated by plotting the true positive rate (TPR) against the false positive rate (FPR) under different classification thresholds. TPR is also known as sensitivity, as described in the above section, whereas FPR can be calculated as specificity.</p>
</sec>
</sec>
<sec id="Sec10">
<title>Results</title>
<sec id="Sec11">
<title>Screening of the optimal feature</title>
<p id="Par33">As described in the section Feature encoding algorithms, we used three feature encoding algorithms from multiple perspectives, that is, compositional information and position-related information, in addition to physicochemical properties. A total of 41 features were used to train the prediction models as shown in Table
<xref rid="Tab1" ref-type="table">1</xref>
. The performances of the protein features were different and the prediction results for the features with the best performance for each type are shown in Table 
<xref rid="Tab2" ref-type="table">2</xref>
. As shown in Table
<xref rid="Tab2" ref-type="table">2</xref>
and Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
a, the predictive model achieved the maximum ACC of 98.18% coupled with the MCC of 0.9638 when the feature GGAP (g = 3) was selected. The performance varied from 96.15 to 98.18% for ACC and from 0.9243 to 0.9638 for MCC. This indicated that the feature GGAP with parameter 3 had the optimal representation ability to distinguish coronaviruses with different phenotypes of cross-species transmission. For the receiver ROC shown in Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
b, the feature GGAP (g = 3) also performed better than the other features (PC-PseAAC or AAC). The optimal GGAP feature representation could be explored to monitor the evolutionary dynamics of coronavirus.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Results of feature representations</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>Feature</th>
<th>ACC</th>
<th>SN</th>
<th>SP</th>
<th>MCC</th>
<th>TP</th>
<th>TN</th>
<th>FP</th>
<th>FN</th>
</tr>
</thead>
<tbody>
<tr>
<td>GGAP (g = 3)</td>
<td>98.18</td>
<td>99.16</td>
<td>97.26</td>
<td>0.9638</td>
<td>2011</td>
<td>2100</td>
<td>59</td>
<td>17</td>
</tr>
<tr>
<td>PC-PseAAC (λ = 2)</td>
<td>96.36</td>
<td>98.61</td>
<td>94.25</td>
<td>0.9284</td>
<td>2000</td>
<td>2035</td>
<td>124</td>
<td>28</td>
</tr>
<tr>
<td>AAC</td>
<td>96.15</td>
<td>98.61</td>
<td>93.83</td>
<td>0.9243</td>
<td>2000</td>
<td>2026</td>
<td>133</td>
<td>28</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<italic>ACC</italic>
Accuracy,
<italic>SN</italic>
Sensitivity,
<italic>SP</italic>
Specificity,
<italic>MCC</italic>
Matthews correlation coefficient,
<italic>TP</italic>
True positive,
<italic>TN</italic>
True negative,
<italic>FP</italic>
False positive,
<italic>FN</italic>
False negative,
<italic>GGAP</italic>
G-gap dipeptide composition,
<italic>PC-PseAAC</italic>
Parallel correlation-based pseudo-amino-acid composition,
<italic>AAC</italic>
Amino acid composition</p>
</table-wrap-foot>
</table-wrap>
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Predictive performance of feature representations.
<bold>a</bold>
Ten-fold cross-validation results.
<bold>b</bold>
Receiver operating characteristic curves generated by plotting the true positive rate (TPR) against the false positive rate (FPR) under different classification thresholds. ACC: Accuracy; SN: Sensitivity; SP: Specificity; MCC: Matthews correlation coefficient; AAC: Amino acid composition; GGAP: G-gap dipeptide composition; PC-PseAAC: Parallel correlation-based pseudo-amino-acid composition</p>
</caption>
<graphic xlink:href="40249_2020_649_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
</sec>
<sec id="Sec12">
<title>Patterns of human coronavirus</title>
<p id="Par34">As shown in Table
<xref rid="Tab2" ref-type="table">2</xref>
and Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
, the GGAP (g = 3) had the best performance and is proposed to monitor the evolutionary dynamics of coronavirus. The features of the 507 human samples in our dataset were used to show the patterns with the multidimensional scaling method. Seven clusters for 229E (α-CoV), NL63 (α-CoV), OC43 (β-CoV), HKU1 (β-CoV), MERS-CoV (β-CoV), SARS-CoV (β-CoV), and SARS-CoV-2 (β-CoV) were formed obviously (Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
). The clusters for 229E and NL63 were closed and located in the upper right of the figure. The cluster for SARS-CoV-2 was very close to that for SARS-CoV, which suggests that both viruses have the same human receptor (angiotensin converting enzyme II, ACE2). The two clusters for MERS and OC43 were far away from SARS-CoV and SARS-CoV-2.
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Patterns of human coronavirus clustered using the multidimensional scaling method. The x and y coordinates denote the first main factor and second main factor, respectively. SARS-CoV-2 is indicated by the blue solid circle</p>
</caption>
<graphic xlink:href="40249_2020_649_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
</sec>
<sec id="Sec13">
<title>Evolutionary dynamics of SARS-CoV and SARS-CoV-2</title>
<p id="Par35">The optimal GGAP feature performed well in terms of predicting infection risk and was used to explore the dynamic of evolution in a simple, fast and massive manner. Based on the GGAP (g = 3) feature, we computed the Euclidean distance of SARS-CoV-2 and SARS-CoV from other coronaviruses in the dataset to explore the evolution dynamic, separately. As shown in Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
a, the distance curve between SARS-CoV-2 and other coronaviruses had two gaps. The ‘big’ gap with values from 0 to 0.02 suggests that the SARS-CoV-2 have no close relation with other isolated coronaviruses. As shown in Fig.
<xref rid="Fig4" ref-type="fig">4</xref>
b, the distance curve between SARS-CoV and other coronaviruses also had a gap of value 0.03, which is similar to that of SARS-CoV-2. The two gaps at 0.03 suggest that coronaviruses close to SARS-CoV-2 s or SARS-CoVs form a separate group. We further checked the coronaviruses close to SARS-CoV-2 and SARS-CoV (< 0.03) and found that these close relatives were the same. The results were similar to those from the MDS method and confirmed that SARS-CoV-2 s and SARS-CoVs have the same origin. Moreover, the big gap at 0.02 suggests that the origin of SARS-CoV-2 s is not clear and further surveillance in the field should be made continuously. The smooth curve for SARS-CoVs shows that its close relatives still exist in nature and public health is challenged as usual.
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Evolutionary dynamic of SARS-CoV-2 and SARS-CoV.
<bold>a</bold>
Euclidean distance between SARS-CoV-2 and other coronaviruses in the dataset.
<bold>b</bold>
Euclidean distance between SARS-CoV and other coronaviruses in the dataset. The x and y coordinates denote the strain number and Euclidean distance based on the GGAP (g = 3) feature, respectively</p>
</caption>
<graphic xlink:href="40249_2020_649_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
</sec>
<sec id="Sec14">
<title>Implementation of the prediction tool</title>
<p id="Par36">We used the Python language to establish an easy-to-use tool that implements our predictor, which is freely accessible via
<ext-link ext-link-type="uri" xlink:href="https://github.com/kouzheng/CovPred-FL">https://github.com/kouzheng/CovPred-FL</ext-link>
and can run in a simple, fast and massive manner. For the convenience of researchers, we provide guidelines on how to use the tool to obtain the desired results: (1) Users need to prepare the query sequences in the FASTA format. Examples of FASTA formatted sequences can be found in the directory mentioned previously. (2) Users need to input the name of the query file and set the confidence parameter before running predictions. The prediction confidence has a range from 0.0 to 0.5. The lower the confidence set by users, the more sensitive the predictions obtained by users. The predicted label for ‘H’ means the phenotype of cross-species transmission while label for ‘N’ means not. The probability for infection risk is also listed in the result file. The file for the features of the query sequence is created to facilitate further analysis.</p>
</sec>
</sec>
<sec id="Sec15">
<title>Discussion</title>
<p id="Par37">At present, SARS-CoV-2 is still circulating in China and the epidemic causes widespread social concern in the world [
<xref ref-type="bibr" rid="CR17">17</xref>
,
<xref ref-type="bibr" rid="CR18">18</xref>
]. As considerable coronaviruses have been isolated from bats and other animals, it is believed that there is a viral gene reservoir in wild animals [
<xref ref-type="bibr" rid="CR7">7</xref>
]. Coronavirus can directly cross the species barrier and infect humans with a severe syndrome [
<xref ref-type="bibr" rid="CR8">8</xref>
]. As an antigen that is novel for a human host, public health is being challenged seriously. With the use of the viral spike protein, in this study, the infection risk of non-human-origin coronavirus was analyzed and a prediction model was constructed for early warning to prevent disease.</p>
<p id="Par38">The spike protein on the surface of the viral particle plays key roles in the binding of the cell receptor and membrane fusion [
<xref ref-type="bibr" rid="CR3">3</xref>
,
<xref ref-type="bibr" rid="CR11">11</xref>
], by which the host range is firmly determined [
<xref ref-type="bibr" rid="CR8">8</xref>
]. In the study, we choose the spike protein as a candidate target to predict the cross-species infection of coronaviruses using the RF method. For the spike protein of coronavirus, the sequence lengths were different and sequence identities were very low between remote relatives, which caused the problem of alignment and challenged the algorithms used to model biology data. For analysis and modeling in a simple, fast and massive manner, we used three different feature encoding algorithms from multiple perspectives, such as compositional information and, position-related information, in addition to physicochemical properties. The computation of protein features did not require multiple sequence alignment and reduced the computational complexity.</p>
<p id="Par39">A total of 41 features were used to train the prediction models. The best predictive model achieved the maximum ACC of 98.18% coupled with the MCC of 0.9638 when the feature GGAP (g = 3) was selected, which indicated that the feature GGAP with parameter 3 had the optimal representation ability to distinguish coronaviruses with different phenotypes of cross-species transmission. As shown in Table
<xref rid="Tab2" ref-type="table">2</xref>
, the number of false positives was 59. The reason for the false positives may be the sporadic infection of coronavirus that originated from an animal or a conflicting description of the ability of human receptor binding. With the improvement of annotation in the database, the false rate could be reduced [
<xref ref-type="bibr" rid="CR19">19</xref>
].</p>
<p id="Par40">The MDS results were similar to those from traditional evolution analysis [
<xref ref-type="bibr" rid="CR1">1</xref>
,
<xref ref-type="bibr" rid="CR3">3</xref>
,
<xref ref-type="bibr" rid="CR5">5</xref>
], which confirmed that the screening of the GGAP (g = 3) feature was reasonable for the prediction of cross-species transmission. Moreover, we computed the Euclidean distance of SARS-CoV-2 and SARS-CoV from other coronaviruses in the dataset to explore the evolution dynamic. The big gap of 0.02 suggests that the origin of SARS-CoV-2 is not clear and further surveillance in the field should be made continuously. As considerable work on molecular epidemiology in the field has been conducted recently, more than 2000 genome sequences of coronavirus isolated from animals have been identified. In addition to various bat species, other animals should be suspected as direct hosts for SARS-CoV-2. According to the smooth curve for SARS-CoVs, the fact should be noted that its close relatives still exist in nature and public health is challenged as usual.</p>
<p id="Par41">Although many proteins contribute to the procedure of virus production and host invasion, the spike protein is the most important factor to determine host range [
<xref ref-type="bibr" rid="CR8">8</xref>
,
<xref ref-type="bibr" rid="CR19">19</xref>
,
<xref ref-type="bibr" rid="CR20">20</xref>
]. A long sequence of the viral genome should be considered in further study to increase the performance of the prediction model. However, applying the algorithm for about 30 000 dimensions of data and small number of samples will be a challenge. In the study, the infection risk of non-human-origin coronavirus was evaluated for early warning and good performance was achieved. The main limitation was that only viral spike proteins were used to build the prediction model and social factors, such as traffic conditions, population size, and citizens habits in daily life, were not involved. Although high risk could be predicted in the view of the pathogen, comprehensive judgment should be used to prevent disease in the future.</p>
</sec>
<sec id="Sec16">
<title>Conclusions</title>
<p id="Par42">In this paper, we presented a predictor for the identification of the transmission phenotype of coronavirus. The major contribution of this predictor is that a set of informative features of viral proteins from 41 feature descriptors, such as compositional, position-specific and physicochemical information, were learned using a machine learning algorithm. The 10-fold cross-validation results showed that good performance was achieved with the use of the GGAP (g = 3) feature. The optimal feature performed well in terms of predicting infection risk and was used to explore the dynamic of evolution in a simple, fast, and massive manner. This study may be beneficial for coronavirus surveillance and future study on the cross-species transmission of coronavirus.</p>
</sec>
<sec sec-type="supplementary-material">
<title>Supplementary information</title>
<sec id="Sec17">
<p>
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="40249_2020_649_MOESM1_ESM.xlsx">
<caption>
<p>
<bold>Additional file 1.</bold>
</p>
</caption>
</media>
</supplementary-material>
<supplementary-material content-type="local-data" id="MOESM2">
<media xlink:href="40249_2020_649_MOESM2_ESM.xlsx">
<caption>
<p>
<bold>Additional file 2.</bold>
</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back>
<glossary>
<title>Abbreviations</title>
<def-list>
<def-item>
<term>AAC</term>
<def>
<p id="Par5">Amino acid composition</p>
</def>
</def-item>
<def-item>
<term>ACC</term>
<def>
<p id="Par6">Accuracy</p>
</def>
</def-item>
<def-item>
<term>CoV</term>
<def>
<p id="Par7">Coronavirus</p>
</def>
</def-item>
<def-item>
<term>FPR</term>
<def>
<p id="Par8">False positive rate</p>
</def>
</def-item>
<def-item>
<term>GGAP</term>
<def>
<p id="Par9">G-gap dipeptide composition</p>
</def>
</def-item>
<def-item>
<term>MCC</term>
<def>
<p id="Par10">Matthews correlation coefficient</p>
</def>
</def-item>
<def-item>
<term>MDS</term>
<def>
<p id="Par11">Multidimensional scaling</p>
</def>
</def-item>
<def-item>
<term>MERS-CoV</term>
<def>
<p id="Par12">Middle East respiratory syndrome coronavirus</p>
</def>
</def-item>
<def-item>
<term>NGDC</term>
<def>
<p id="Par13">National Genomics Data Center</p>
</def>
</def-item>
<def-item>
<term>PC-PseAAC</term>
<def>
<p id="Par14">Parallel correlation-based pseudo-amino-acid composition</p>
</def>
</def-item>
<def-item>
<term>RF</term>
<def>
<p id="Par15">Random forest</p>
</def>
</def-item>
<def-item>
<term>ROC</term>
<def>
<p id="Par16">Receiver operating characteristic</p>
</def>
</def-item>
<def-item>
<term>SARS-CoV</term>
<def>
<p id="Par17">Severe acute respiratory syndrome coronavirus</p>
</def>
</def-item>
<def-item>
<term>SARS-CoV-2</term>
<def>
<p id="Par18">Severe acute respiratory syndrome coronavirus 2</p>
</def>
</def-item>
<def-item>
<term>SN</term>
<def>
<p id="Par19">Sensitivity</p>
</def>
</def-item>
<def-item>
<term>SP</term>
<def>
<p id="Par20">Specificity</p>
</def>
</def-item>
<def-item>
<term>TPR</term>
<def>
<p id="Par21">True positive rate</p>
</def>
</def-item>
</def-list>
</glossary>
<sec>
<title>Supplementary information</title>
<p>
<bold>Supplementary information</bold>
accompanies this paper at 10.1186/s40249-020-00649-8.</p>
</sec>
<ack>
<title>Acknowledgements</title>
<p>We would like to acknowledge the originating and submitting laboratories of the viral sequences from the NGDC’s 2019nCoVR database. We thank Dr. Maxine Garcia for editing the English text of this manuscript.</p>
</ack>
<notes notes-type="author-contribution">
<title>Authors’ contributions</title>
<p>XQ and ZK designed the framework of analysis. XQ and PX performed all computational work. GF and WL implemented the code and software. ZK wrote the manuscript. All authors read and approved the final manuscript.</p>
</notes>
<notes notes-type="funding-information">
<title>Funding</title>
<p>This work was supported by the National Natural Science Foundation of China (61972109, 61632002) and the Natural Science Foundation of Guangdong Province of China (2018A030313380).</p>
</notes>
<notes notes-type="data-availability">
<title>Availability of data and materials</title>
<p>The protein sequences of 2666 coronaviruses analyzed during the current study are available in the NGDC’s 2019nCoVR Database,
<ext-link ext-link-type="uri" xlink:href="https://bigd.big.ac.cn/ncov">https://bigd.big.ac.cn/ncov</ext-link>
[
<xref ref-type="bibr" rid="CR12">12</xref>
]. The nomenclature for coronavirus in the dataset is provided as Additional file
<xref rid="MOESM1" ref-type="media">1</xref>
. The clustering details for the MDS method is provided as Additional file 
<xref rid="MOESM2" ref-type="media">2</xref>
.</p>
</notes>
<notes>
<title>Ethics approval and consent to participate</title>
<p id="Par43">Not applicable.</p>
</notes>
<notes>
<title>Consent for publication</title>
<p id="Par44">Not applicable.</p>
</notes>
<notes notes-type="COI-statement">
<title>Competing interests</title>
<p id="Par45">The authors declare that they have no competing interests.</p>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gorbalenya</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Enjuanes</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Ziebuhr</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Snijder</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Nidovirales: evolving the largest RNA virus genome</article-title>
<source>Virus Res</source>
<year>2006</year>
<volume>117</volume>
<issue>1</issue>
<fpage>17</fpage>
<lpage>37</lpage>
<pub-id pub-id-type="pmid">16503362</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Corman</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Muth</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Niemeyer</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Drosten</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Hosts and sources of endemic human coronaviruses</article-title>
<source>Adv Virus Res</source>
<year>2018</year>
<volume>100</volume>
<fpage>163</fpage>
<lpage>188</lpage>
<pub-id pub-id-type="pmid">29551135</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cui</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>ZL</given-names>
</name>
</person-group>
<article-title>Origin and evolution of pathogenic coronaviruses</article-title>
<source>Nat Rev Microbiol</source>
<year>2019</year>
<volume>17</volume>
<issue>3</issue>
<fpage>181</fpage>
<lpage>192</lpage>
<pub-id pub-id-type="pmid">30531947</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<mixed-citation publication-type="other">Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, et al. A novel coronavirus from patients with pneumonia in China, 2019. N Engl J Med. 2020. 10.1056/NEJMoa2001017.</mixed-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<mixed-citation publication-type="other">Wu F, Zhao S, Yu B, Chen Y, Wang W, Song Z, et al. A new coronavirus associated with human respiratory disease in China. Nature. 2020. 10.1038/s41586-020-2008-3.</mixed-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<mixed-citation publication-type="other">Zhou P, Yang X, Wang X, Hu B, Zhang L, Zhang W, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020. 10.1038/s41586-020-2012-7.</mixed-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Adams</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Carstens</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Ratification vote on taxonomic proposals to the international committee on taxonomy of viruses</article-title>
<source>Arch Virol</source>
<year>2012</year>
<volume>157</volume>
<issue>7</issue>
<fpage>1411</fpage>
<lpage>1422</lpage>
<pub-id pub-id-type="pmid">22481600</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Menachery</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Yount</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Debbink</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Agnihothram</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Gralinski</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Plante</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence</article-title>
<source>Nat Med</source>
<year>2015</year>
<volume>21</volume>
<fpage>1508</fpage>
<lpage>1513</lpage>
<pub-id pub-id-type="pmid">26552008</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qiang</surname>
<given-names>XL</given-names>
</name>
<name>
<surname>Kou</surname>
<given-names>Z</given-names>
</name>
</person-group>
<article-title>Scoring amino acid mutation to predict pandemic risk of avian influenza virus</article-title>
<source>BMC Bioinformatics</source>
<year>2019</year>
<volume>20</volume>
<issue>S8</issue>
<fpage>288</fpage>
<pub-id pub-id-type="pmid">31182019</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Qiang</surname>
<given-names>XL</given-names>
</name>
<name>
<surname>Kou</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Scoring amino acid mutations to predict avian-to-human transmission of avian influenza viruses</article-title>
<source>Molecules</source>
<year>2018</year>
<volume>23</volume>
<issue>7</issue>
<fpage>1584</fpage>
</element-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Heald-Sargent</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Gallagher</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Ready, set, fuse! The coronavirus spike protein and acquisition of fusion competence</article-title>
<source>Viruses.</source>
<year>2012</year>
<volume>4</volume>
<issue>4</issue>
<fpage>557</fpage>
<lpage>580</lpage>
<pub-id pub-id-type="pmid">22590686</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>WM</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>SH</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>Zou</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>LN</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>YK</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The 2019 novel coronavirus resource</article-title>
<source>Yi Chuan</source>
<year>2020</year>
<volume>42</volume>
<issue>2</issue>
<fpage>212</fpage>
<lpage>221</lpage>
<pub-id pub-id-type="pmid">32102777</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>K</given-names>
</name>
</person-group>
<article-title>Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences</article-title>
<source>Nucleic Acids Res</source>
<year>2015</year>
<volume>43</volume>
<issue>W1</issue>
<fpage>W65</fpage>
<lpage>W71</lpage>
<pub-id pub-id-type="pmid">25958395</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Atchley</surname>
<given-names>WR</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Fernandes</surname>
<given-names>AD</given-names>
</name>
<name>
<surname>Drüke</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Solving the protein sequence metric problem</article-title>
<source>Proc Natl Acad Sci U S A</source>
<year>2005</year>
<volume>102</volume>
<fpage>6395</fpage>
<lpage>6400</lpage>
<pub-id pub-id-type="pmid">15851683</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liaw</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Wiener</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Classification and regression by random Forest</article-title>
<source>R News</source>
<year>2002</year>
<volume>2</volume>
<fpage>18</fpage>
<lpage>22</lpage>
</element-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sing</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Sander</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Beerenwinkel</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Lengauer</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>ROCR: visualizing classifier performance in R</article-title>
<source>Bioinformatics</source>
<year>2005</year>
<volume>21</volume>
<fpage>7881</fpage>
</element-citation>
</ref>
<ref id="CR17">
<label>17.</label>
<mixed-citation publication-type="other">Guan WJ, Ni ZY, Hu Y, Liang WH, Ou CQ, He JX, et al. Clinical characteristics of coronavirus disease 2019 in China. N Engl J Med. 2020. 10.1056/NEJMoa2002032.</mixed-citation>
</ref>
<ref id="CR18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Holshue</surname>
<given-names>ML</given-names>
</name>
<name>
<surname>DeBolt</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Lindquist</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Lofy</surname>
<given-names>KH</given-names>
</name>
<name>
<surname>Wiesman</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Bruce</surname>
<given-names>H</given-names>
</name>
<etal></etal>
</person-group>
<article-title>First case of 2019 novel coronavirus in the United States</article-title>
<source>N Engl J Med</source>
<year>2020</year>
<volume>382</volume>
<fpage>929</fpage>
<lpage>936</lpage>
<pub-id pub-id-type="pmid">32004427</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19.</label>
<mixed-citation publication-type="other">Letko M, Marzi A, Munster V. Functional assessment of cell entry and receptor usage for SARS-CoV-2 and other lineage B betacoronaviruses. Nat Microbiol. 2020. 10.1038/s41564-020-0688-y.</mixed-citation>
</ref>
<ref id="CR20">
<label>20.</label>
<mixed-citation publication-type="other">Wrapp D, Wang N, Corbett K, Goldsmith J, Hsieh C, Abiona O, et al. Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation. Science. 2020. 10.1126/science.abb2507.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/SrasV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000391 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000391 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    SrasV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:7093988
   |texte=   Using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:32209118" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a SrasV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Tue Apr 28 14:49:16 2020. Site generation: Sat Mar 27 22:06:49 2021