Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Recombination spot identification Based on gapped k-mers

Identifieur interne : 000144 ( Pmc/Corpus ); précédent : 000143; suivant : 000145

Recombination spot identification Based on gapped k-mers

Auteurs : Rong Wang ; Yong Xu ; Bin Liu

Source :

RBID : PMC:4814916

Abstract

Recombination is crucial for biological evolution, which provides many new combinations of genetic diversity. Accurate identification of recombination spots is useful for DNA function study. To improve the prediction accuracy, researchers have proposed several computational methods for recombination spot identification. The k-mer feature is one of the most useful features for modeling the properties and function of DNA sequences. However, it suffers from the inherent limitation. If the value of word length k is large, the occurrences of k-mers are closed to a binary variable, with a few k-mers present once and most k-mers are absent. This usually causes the sparse problem and reduces the classification accuracy. To solve this problem, we add gaps into k-mer and introduce a new feature called gapped k-mer (GKM) for identification of recombination spots. By using this feature, we present a new predictor called SVM-GKM, which combines the gapped k-mers and Support Vector Machine (SVM) for recombination spot identification. Experimental results on a widely used benchmark dataset show that SVM-GKM outperforms other highly related predictors. Therefore, SVM-GKM would be a powerful predictor for computational genomics.


Url:
DOI: 10.1038/srep23934
PubMed: 27030570
PubMed Central: 4814916

Links to Exploration step

PMC:4814916

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Recombination spot identification Based on gapped k-mers</title>
<author>
<name sortKey="Wang, Rong" sort="Wang, Rong" uniqKey="Wang R" first="Rong" last="Wang">Rong Wang</name>
<affiliation>
<nlm:aff id="a1">
<institution>School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School</institution>
, Shenzhen, Guangdong 518055,
<country>China</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xu, Yong" sort="Xu, Yong" uniqKey="Xu Y" first="Yong" last="Xu">Yong Xu</name>
<affiliation>
<nlm:aff id="a1">
<institution>School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School</institution>
, Shenzhen, Guangdong 518055,
<country>China</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Liu, Bin" sort="Liu, Bin" uniqKey="Liu B" first="Bin" last="Liu">Bin Liu</name>
<affiliation>
<nlm:aff id="a1">
<institution>School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School</institution>
, Shenzhen, Guangdong 518055,
<country>China</country>
</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">27030570</idno>
<idno type="pmc">4814916</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4814916</idno>
<idno type="RBID">PMC:4814916</idno>
<idno type="doi">10.1038/srep23934</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000144</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000144</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Recombination spot identification Based on gapped k-mers</title>
<author>
<name sortKey="Wang, Rong" sort="Wang, Rong" uniqKey="Wang R" first="Rong" last="Wang">Rong Wang</name>
<affiliation>
<nlm:aff id="a1">
<institution>School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School</institution>
, Shenzhen, Guangdong 518055,
<country>China</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Xu, Yong" sort="Xu, Yong" uniqKey="Xu Y" first="Yong" last="Xu">Yong Xu</name>
<affiliation>
<nlm:aff id="a1">
<institution>School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School</institution>
, Shenzhen, Guangdong 518055,
<country>China</country>
</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Liu, Bin" sort="Liu, Bin" uniqKey="Liu B" first="Bin" last="Liu">Bin Liu</name>
<affiliation>
<nlm:aff id="a1">
<institution>School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School</institution>
, Shenzhen, Guangdong 518055,
<country>China</country>
</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Scientific Reports</title>
<idno type="eISSN">2045-2322</idno>
<imprint>
<date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Recombination is crucial for biological evolution, which provides many new combinations of genetic diversity. Accurate identification of recombination spots is useful for DNA function study. To improve the prediction accuracy, researchers have proposed several computational methods for recombination spot identification. The k-mer feature is one of the most useful features for modeling the properties and function of DNA sequences. However, it suffers from the inherent limitation. If the value of word length
<italic>k</italic>
is large, the occurrences of k-mers are closed to a binary variable, with a few k-mers present once and most k-mers are absent. This usually causes the sparse problem and reduces the classification accuracy. To solve this problem, we add gaps into k-mer and introduce a new feature called gapped k-mer (GKM) for identification of recombination spots. By using this feature, we present a new predictor called SVM-GKM, which combines the gapped k-mers and Support Vector Machine (SVM) for recombination spot identification. Experimental results on a widely used benchmark dataset show that SVM-GKM outperforms other highly related predictors. Therefore, SVM-GKM would be a powerful predictor for computational genomics.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, W" uniqKey="Chen W">W. Chen</name>
</author>
<author>
<name sortKey="Feng, P" uniqKey="Feng P">P. Feng</name>
</author>
<author>
<name sortKey="Lin, H" uniqKey="Lin H">H. Lin</name>
</author>
<author>
<name sortKey="Chou, K" uniqKey="Chou K">K. Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Arnheim, N" uniqKey="Arnheim N">N. Arnheim</name>
</author>
<author>
<name sortKey="Calabrese, P" uniqKey="Calabrese P">P. Calabrese</name>
</author>
<author>
<name sortKey="Tiemann Boege, I" uniqKey="Tiemann Boege I">I. Tiemann-Boege</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X. Zhang</name>
</author>
<author>
<name sortKey="Tian, Y" uniqKey="Tian Y">Y. Tian</name>
</author>
<author>
<name sortKey="Cheng, R" uniqKey="Cheng R">R. Cheng</name>
</author>
<author>
<name sortKey="Jin, Y" uniqKey="Jin Y">Y. Jin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X. Zhang</name>
</author>
<author>
<name sortKey="Tian, Y" uniqKey="Tian Y">Y. Tian</name>
</author>
<author>
<name sortKey="Jin, Y" uniqKey="Jin Y">Y. Jin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, L" uniqKey="Li L">L. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wei, L" uniqKey="Wei L">L. Wei</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Weyn, B" uniqKey="Weyn B">B. Weyn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zou, Q" uniqKey="Zou Q">Q. Zou</name>
</author>
<author>
<name sortKey="Chen, W" uniqKey="Chen W">W. Chen</name>
</author>
<author>
<name sortKey="Huang, Y" uniqKey="Huang Y">Y. Huang</name>
</author>
<author>
<name sortKey="Liu, X" uniqKey="Liu X">X. Liu</name>
</author>
<author>
<name sortKey="Jiang, Y" uniqKey="Jiang Y">Y. Jiang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peng, J" uniqKey="Peng J">J. Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cheng, X Y" uniqKey="Cheng X">X.-Y. Cheng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zeng, X" uniqKey="Zeng X">X. Zeng</name>
</author>
<author>
<name sortKey="Xu, L" uniqKey="Xu L">L. Xu</name>
</author>
<author>
<name sortKey="Liu, X" uniqKey="Liu X">X. Liu</name>
</author>
<author>
<name sortKey="Pan, L" uniqKey="Pan L">L. Pan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lin, C" uniqKey="Lin C">C. Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zou, Q" uniqKey="Zou Q">Q. Zou</name>
</author>
<author>
<name sortKey="Li, X" uniqKey="Li X">X. Li</name>
</author>
<author>
<name sortKey="Jiang, Y" uniqKey="Jiang Y">Y. Jiang</name>
</author>
<author>
<name sortKey="Zhao, Y" uniqKey="Zhao Y">Y. Zhao</name>
</author>
<author>
<name sortKey="Wang, G" uniqKey="Wang G">G. Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zou, Q" uniqKey="Zou Q">Q. Zou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zou, Q" uniqKey="Zou Q">Q. Zou</name>
</author>
<author>
<name sortKey="Zeng, J" uniqKey="Zeng J">J. Zeng</name>
</author>
<author>
<name sortKey="Cao, L" uniqKey="Cao L">L. Cao</name>
</author>
<author>
<name sortKey="Ji, R" uniqKey="Ji R">R. Ji</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gerton, J L" uniqKey="Gerton J">J. L. Gerton</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, G" uniqKey="Liu G">G. Liu</name>
</author>
<author>
<name sortKey="Jia, L" uniqKey="Jia L">L. Jia</name>
</author>
<author>
<name sortKey="Cui, X" uniqKey="Cui X">X. Cui</name>
</author>
<author>
<name sortKey="Lu, C" uniqKey="Lu C">C. Lu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nanni, L" uniqKey="Nanni L">L. Nanni</name>
</author>
<author>
<name sortKey="Lumini, A" uniqKey="Lumini A">A. Lumini</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sahu, S S" uniqKey="Sahu S">S. S. Sahu</name>
</author>
<author>
<name sortKey="Panda, G" uniqKey="Panda G">G. Panda</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nanni, L" uniqKey="Nanni L">L. Nanni</name>
</author>
<author>
<name sortKey="Lumini, A" uniqKey="Lumini A">A. Lumini</name>
</author>
<author>
<name sortKey="Gupta, D" uniqKey="Gupta D">D. Gupta</name>
</author>
<author>
<name sortKey="Garg, A" uniqKey="Garg A">A. Garg</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chou, K" uniqKey="Chou K">K. Chou</name>
</author>
<author>
<name sortKey="Com, M P" uniqKey="Com M">M. P. Com</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Getun, I V" uniqKey="Getun I">I. V. Getun</name>
</author>
<author>
<name sortKey="Wu, Z K" uniqKey="Wu Z">Z. K. Wu</name>
</author>
<author>
<name sortKey="Khalil, A M" uniqKey="Khalil A">A. M. Khalil</name>
</author>
<author>
<name sortKey="Bois, P R J" uniqKey="Bois P">P. R. J. Bois</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Nasar, F" uniqKey="Nasar F">F. Nasar</name>
</author>
<author>
<name sortKey="Jankowski, C" uniqKey="Jankowski C">C. Jankowski</name>
</author>
<author>
<name sortKey="Nag, D K" uniqKey="Nag D">D. K. Nag</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wei, L" uniqKey="Wei L">L. Wei</name>
</author>
<author>
<name sortKey="Liao, M" uniqKey="Liao M">M. Liao</name>
</author>
<author>
<name sortKey="Gao, X" uniqKey="Gao X">X. Gao</name>
</author>
<author>
<name sortKey="Zou, Q" uniqKey="Zou Q">Q. Zou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Meunier, J" uniqKey="Meunier J">J. Meunier</name>
</author>
<author>
<name sortKey="Duret, L" uniqKey="Duret L">L. Duret</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, G" uniqKey="Liu G">G. Liu</name>
</author>
<author>
<name sortKey="Li, H" uniqKey="Li H">H. Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Myers, S" uniqKey="Myers S">S. Myers</name>
</author>
<author>
<name sortKey="Freeman, C" uniqKey="Freeman C">C. Freeman</name>
</author>
<author>
<name sortKey="Auton, A" uniqKey="Auton A">A. Auton</name>
</author>
<author>
<name sortKey="Donnelly, P" uniqKey="Donnelly P">P. Donnelly</name>
</author>
<author>
<name sortKey="Mcvean, G" uniqKey="Mcvean G">G. Mcvean</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Christopher, F B" uniqKey="Christopher F">F. B. Christopher</name>
</author>
<author>
<name sortKey="Dongwon, L" uniqKey="Dongwon L">L. Dongwon</name>
</author>
<author>
<name sortKey="Mccallion, A S" uniqKey="Mccallion A">A. S. Mccallion</name>
</author>
<author>
<name sortKey="Beer, M" uniqKey="Beer M">M. Beer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ghandi, M" uniqKey="Ghandi M">M. Ghandi</name>
</author>
<author>
<name sortKey="Mohammad Noori, M" uniqKey="Mohammad Noori M">M. Mohammad-Noori</name>
</author>
<author>
<name sortKey="Beer, M A" uniqKey="Beer M">M. A. Beer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D. Lee</name>
</author>
<author>
<name sortKey="Karchin, R" uniqKey="Karchin R">R. Karchin</name>
</author>
<author>
<name sortKey="Beer, M A" uniqKey="Beer M">M. A. Beer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ghandi, M" uniqKey="Ghandi M">M. Ghandi</name>
</author>
<author>
<name sortKey="Lee, D" uniqKey="Lee D">D. Lee</name>
</author>
<author>
<name sortKey="Mohammad Noori, M" uniqKey="Mohammad Noori M">M. Mohammad-Noori</name>
</author>
<author>
<name sortKey="Beer, M A" uniqKey="Beer M">M. A. Beer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
<author>
<name sortKey="Fang, L" uniqKey="Fang L">L. Fang</name>
</author>
<author>
<name sortKey="Jie, C" uniqKey="Jie C">C. Jie</name>
</author>
<author>
<name sortKey="Liu, F" uniqKey="Liu F">F. Liu</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X. Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Quek, L E" uniqKey="Quek L">L. E. Quek</name>
</author>
<author>
<name sortKey="Nielsen, L K" uniqKey="Nielsen L">L. K. Nielsen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhu, T" uniqKey="Zhu T">T. Zhu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leslie, C S" uniqKey="Leslie C">C. S. Leslie</name>
</author>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E. Eskin</name>
</author>
<author>
<name sortKey="Cohen, A" uniqKey="Cohen A">A. Cohen</name>
</author>
<author>
<name sortKey="Weston, J" uniqKey="Weston J">J. Weston</name>
</author>
<author>
<name sortKey="Noble, W S" uniqKey="Noble W">W. S. Noble</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zeng, X" uniqKey="Zeng X">X. Zeng</name>
</author>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X. Zhang</name>
</author>
<author>
<name sortKey="Zou, Q" uniqKey="Zou Q">Q. Zou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, W" uniqKey="Chen W">W. Chen</name>
</author>
<author>
<name sortKey="Feng, P" uniqKey="Feng P">P. Feng</name>
</author>
<author>
<name sortKey="Lin, H" uniqKey="Lin H">H. Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, W" uniqKey="Chen W">W. Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, W" uniqKey="Chen W">W. Chen</name>
</author>
<author>
<name sortKey="Feng, P M" uniqKey="Feng P">P.-M. Feng</name>
</author>
<author>
<name sortKey="Lin, H" uniqKey="Lin H">H. Lin</name>
</author>
<author>
<name sortKey="Chou, K C" uniqKey="Chou K">K.-C. Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Manoj, B" uniqKey="Manoj B">B. Manoj</name>
</author>
<author>
<name sortKey="Raghava, G P S" uniqKey="Raghava G">G. P. S. Raghava</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hua, S" uniqKey="Hua S">S. Hua</name>
</author>
<author>
<name sortKey="Sun, Z" uniqKey="Sun Z">Z. Sun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bhasin, M" uniqKey="Bhasin M">M. Bhasin</name>
</author>
<author>
<name sortKey="Reinherz, E L" uniqKey="Reinherz E">E. L. Reinherz</name>
</author>
<author>
<name sortKey="Reche, P A" uniqKey="Reche P">P. A. Reche</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Leslie, C" uniqKey="Leslie C">C. Leslie</name>
</author>
<author>
<name sortKey="Eskin, E" uniqKey="Eskin E">E. Eskin</name>
</author>
<author>
<name sortKey="Noble, W S" uniqKey="Noble W">W. S. Noble</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J. Chen</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X. Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
<author>
<name sortKey="Fang, L" uniqKey="Fang L">L. Fang</name>
</author>
<author>
<name sortKey="Long, R" uniqKey="Long R">R. Long</name>
</author>
<author>
<name sortKey="Lan, X" uniqKey="Lan X">X. Lan</name>
</author>
<author>
<name sortKey="Chou, K C" uniqKey="Chou K">K.-C. Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J. Chen</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X. Wang</name>
</author>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, S" uniqKey="Yang S">S. Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, S" uniqKey="Yang S">S. Yang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wei, C" uniqKey="Wei C">C. Wei</name>
</author>
<author>
<name sortKey="Peng Mian, F" uniqKey="Peng Mian F">F. Peng-Mian</name>
</author>
<author>
<name sortKey="Hao, L" uniqKey="Hao L">L. Hao</name>
</author>
<author>
<name sortKey="Kuo Chen, C" uniqKey="Kuo Chen C">C. Kuo-Chen</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, S" uniqKey="Chen S">S. Chen</name>
</author>
<author>
<name sortKey="Zhu, Y" uniqKey="Zhu Y">Y. Zhu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Smith, L I" uniqKey="Smith L">L. I. Smith</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
<author>
<name sortKey="Chen, J" uniqKey="Chen J">J. Chen</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X. Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Steiner, W W" uniqKey="Steiner W">W. W. Steiner</name>
</author>
<author>
<name sortKey="Steiner, E M" uniqKey="Steiner E">E. M. Steiner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Getun, I V" uniqKey="Getun I">I. V. Getun</name>
</author>
<author>
<name sortKey="Wu, Z K" uniqKey="Wu Z">Z. K. Wu</name>
</author>
<author>
<name sortKey="Bois, P R J" uniqKey="Bois P">P. R. J. Bois</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, B" uniqKey="Liu B">B. Liu</name>
</author>
<author>
<name sortKey="Fang, L" uniqKey="Fang L">L. Fang</name>
</author>
<author>
<name sortKey="Liu, F" uniqKey="Liu F">F. Liu</name>
</author>
<author>
<name sortKey="Wang, X" uniqKey="Wang X">X. Wang</name>
</author>
<author>
<name sortKey="Chou, K C" uniqKey="Chou K">K.-C. Chou</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X. Zhang</name>
</author>
<author>
<name sortKey="Pan, L" uniqKey="Pan L">L. Pan</name>
</author>
<author>
<name sortKey="P Un, A" uniqKey="P Un A">A. Păun</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, T" uniqKey="Song T">T. Song</name>
</author>
<author>
<name sortKey="Pan, L" uniqKey="Pan L">L. Pan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, X" uniqKey="Zhang X">X. Zhang</name>
</author>
<author>
<name sortKey="Zeng, X" uniqKey="Zeng X">X. Zeng</name>
</author>
<author>
<name sortKey="Luo, B" uniqKey="Luo B">B. Luo</name>
</author>
<author>
<name sortKey="Pan, L" uniqKey="Pan L">L. Pan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Song, T" uniqKey="Song T">T. Song</name>
</author>
<author>
<name sortKey="Pan, L" uniqKey="Pan L">L. Pan</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Sci Rep</journal-id>
<journal-id journal-id-type="iso-abbrev">Sci Rep</journal-id>
<journal-title-group>
<journal-title>Scientific Reports</journal-title>
</journal-title-group>
<issn pub-type="epub">2045-2322</issn>
<publisher>
<publisher-name>Nature Publishing Group</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">27030570</article-id>
<article-id pub-id-type="pmc">4814916</article-id>
<article-id pub-id-type="pii">srep23934</article-id>
<article-id pub-id-type="doi">10.1038/srep23934</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Recombination spot identification Based on gapped k-mers</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Wang</surname>
<given-names>Rong</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Xu</surname>
<given-names>Yong</given-names>
</name>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Bin</given-names>
</name>
<xref ref-type="corresp" rid="c1">a</xref>
<xref ref-type="aff" rid="a1">1</xref>
</contrib>
<aff id="a1">
<label>1</label>
<institution>School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School</institution>
, Shenzhen, Guangdong 518055,
<country>China</country>
</aff>
</contrib-group>
<author-notes>
<corresp id="c1">
<label>a</label>
<email>bliu@insun.hit.edu.cn</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>31</day>
<month>03</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="collection">
<year>2016</year>
</pub-date>
<volume>6</volume>
<elocation-id>23934</elocation-id>
<history>
<date date-type="received">
<day>14</day>
<month>09</month>
<year>2015</year>
</date>
<date date-type="accepted">
<day>16</day>
<month>03</month>
<year>2016</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright © 2016, Macmillan Publishers Limited</copyright-statement>
<copyright-year>2016</copyright-year>
<copyright-holder>Macmillan Publishers Limited</copyright-holder>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<pmc-comment>author-paid</pmc-comment>
<license-p>This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
</license-p>
</license>
</permissions>
<abstract>
<p>Recombination is crucial for biological evolution, which provides many new combinations of genetic diversity. Accurate identification of recombination spots is useful for DNA function study. To improve the prediction accuracy, researchers have proposed several computational methods for recombination spot identification. The k-mer feature is one of the most useful features for modeling the properties and function of DNA sequences. However, it suffers from the inherent limitation. If the value of word length
<italic>k</italic>
is large, the occurrences of k-mers are closed to a binary variable, with a few k-mers present once and most k-mers are absent. This usually causes the sparse problem and reduces the classification accuracy. To solve this problem, we add gaps into k-mer and introduce a new feature called gapped k-mer (GKM) for identification of recombination spots. By using this feature, we present a new predictor called SVM-GKM, which combines the gapped k-mers and Support Vector Machine (SVM) for recombination spot identification. Experimental results on a widely used benchmark dataset show that SVM-GKM outperforms other highly related predictors. Therefore, SVM-GKM would be a powerful predictor for computational genomics.</p>
</abstract>
</article-meta>
</front>
<body>
<p>Recombination plays an important role in genetic evolution, which describes the exchange of genetic information during the period of each generation in diploid organisms
<xref ref-type="bibr" rid="b1">1</xref>
. The original genetic information is generated from homologous chromosomes. Therefore, recombination provides many new combinations of genetic variations and is an important source for biodiversity
<xref ref-type="bibr" rid="b2">2</xref>
<xref ref-type="bibr" rid="b3">3</xref>
<xref ref-type="bibr" rid="b4">4</xref>
, which can accelerate the procedure of biological evolution.</p>
<p>To improve the predictive accuracy, researchers have proposed several computational methods for recombination spot identification, which are based on some well known machine learning techniques, such as support vector machine (SVM)
<xref ref-type="bibr" rid="b5">5</xref>
<xref ref-type="bibr" rid="b6">6</xref>
, K-nearest neighbor (KNN)
<xref ref-type="bibr" rid="b7">7</xref>
<xref ref-type="bibr" rid="b8">8</xref>
, Random Forest(RF)
<xref ref-type="bibr" rid="b9">9</xref>
<xref ref-type="bibr" rid="b10">10</xref>
, ensemble classifiers
<xref ref-type="bibr" rid="b11">11</xref>
<xref ref-type="bibr" rid="b12">12</xref>
<xref ref-type="bibr" rid="b13">13</xref>
<xref ref-type="bibr" rid="b14">14</xref>
, ranking
<xref ref-type="bibr" rid="b15">15</xref>
, etc. Various features are employed by these methods. The first computational predictor for recombination identification is based on sequence dependent frequencies
<xref ref-type="bibr" rid="b16">16</xref>
. Liu
<italic>et al.</italic>
<xref ref-type="bibr" rid="b17">17</xref>
have exploited quadratic discriminant analysis to predict hot or cold spots. However, these methods only consider the local sequence composition information, and ignore all the long-range or global sequence-order effects. To overcome this disadvantage, Li
<italic>et al.</italic>
<xref ref-type="bibr" rid="b5">5</xref>
propose a novel method based on nucleic acid composition (NAC), n-tier NAC and pseudo nucleic acid composition (PseNAC). Following this study, researchers have proposed various predictors
<xref ref-type="bibr" rid="b18">18</xref>
<xref ref-type="bibr" rid="b19">19</xref>
<xref ref-type="bibr" rid="b20">20</xref>
<xref ref-type="bibr" rid="b21">21</xref>
. It has been shown that recombination not only depends on DNA primary sequences, but also is influenced by the chromatin structure. Getun
<italic>et al.</italic>
<xref ref-type="bibr" rid="b22">22</xref>
have exploited nucleosome occupancy to identify mouse recombination hotspots. Besides these features, some other sequence features also influence recombination and representative samples, such as the palindrome structure
<xref ref-type="bibr" rid="b23">23</xref>
<xref ref-type="bibr" rid="b24">24</xref>
, relatively high GC content
<xref ref-type="bibr" rid="b25">25</xref>
, dinucleotides bias
<xref ref-type="bibr" rid="b26">26</xref>
, repeats, consensus DNA motifs
<xref ref-type="bibr" rid="b27">27</xref>
, etc. Therefore, some computational predictors employ these features, and achieve better performance.</p>
<p>All these computational methods could yield quite encouraging results, and each of them did play a role in stimulating the development of recombination spot identification. However, further study is needed due to the following reason. Among the aforementioned features, k-mer
<xref ref-type="bibr" rid="b6">6</xref>
<xref ref-type="bibr" rid="b28">28</xref>
<xref ref-type="bibr" rid="b29">29</xref>
<xref ref-type="bibr" rid="b30">30</xref>
<xref ref-type="bibr" rid="b31">31</xref>
<xref ref-type="bibr" rid="b32">32</xref>
is one of the simplest, and most widely used features in this field. The k-mer is a nucleotide fragment with
<italic>k</italic>
neighboring residues. By using this feature, the local sequence composition information can be extracted. Typically, the value of
<italic>k</italic>
is set to 6 or 7, and the length of their corresponding feature is 4
<sup>6</sup>
 = 4096 or 4
<sup>7</sup>
 = 16384. Actually, larger
<italic>k</italic>
values are preferred, because more sequence composition information can be incorporated. However, large
<italic>k</italic>
values (
<italic>k</italic>
 > 6) will lead to extremely sparse feature vectors, which may cause a severe over-fitting problem. In order to find a tradeoff between the sparse feature space problem and more sequence composition information, the gapped k-mer has been proposed, and successfully applied to enhancer identification
<xref ref-type="bibr" rid="b33">33</xref>
<xref ref-type="bibr" rid="b34">34</xref>
. Gapped k-mer allows several gaps to exist in k-mers. Therefore, it cannot only significantly reduce the length of the resulting feature vectors, but also takes the evolutionary process into consideration. The evolution involves changes of single residues, insertions and deletions of several residues, gene doubling and gene fusion. With these changes accumulated for a long period, many similarities between initial and resultant DNA sequences are gradually eliminated, but they may still share many common features. GKM is able to consider these changes in the DNA sequences via using the gaps.</p>
<p>In this study, we apply the gapped k-mer to recombination spot identification, and propose a new computational predictor called SVM-GKM via combining GKM with Support Vector Machines. Experimental results on a widely used benchmark dataset show that SVM-GKM outperforms the two state-of-the-art methods in the field of recombination spot identification, and some interesting patterns can be discovered by analyzing the discriminative features in SVM-GKM.</p>
<sec disp-level="1">
<title>Materials and Methods</title>
<sec disp-level="2">
<title>Benchmark Dataset</title>
<p>Here, we employ a benchmark dataset taken from Liu
<italic>et al.</italic>
<xref ref-type="bibr" rid="b17">17</xref>
to evaluate the performance of various predictors for recombination identification. This benchmark dataset contains a recombination hotspot subset and a recombination coldspot subset, which can be defined as</p>
<p>
<disp-formula id="eq1">
<inline-graphic id="d33e200" xlink:href="srep23934-m1.jpg"></inline-graphic>
</disp-formula>
</p>
<p>where positive subset ∑
<sup>+</sup>
contains recombination hotspots, negative subset ∑
<sup></sup>
contains recombination coldspots, and symbol ∪ represents the “union” in the set theory. There are 490 hotspots in ∑
<sup>+</sup>
and 591 coldspots in ∑
<sup></sup>
. The codes of the 1081 DNA samples as well as their detailed sequences are given in the
<xref ref-type="supplementary-material" rid="S1">Supplementary S1</xref>
.</p>
</sec>
<sec disp-level="2">
<title>Gapped k-mer</title>
<p>With the increase of word length
<italic>k</italic>
, the method based on k-mers could cause the sparse problem. This is because many k-mers are not appeared in one DNA sequence, and thus its feature vector may contain a large amount of zero values. To overcome this disadvantage caused by k-mers, Ghandi
<italic>et al.</italic>
<xref ref-type="bibr" rid="b33">33</xref>
propose a new feature named gapped k-mer method (GKM), which uses k-mers with gaps. Experimental results show that this feature is able to obviously improve the performance for enhancer identification. Motivated by its success, in this study, we apply the GKM to the field of recombination hotspots identification, and propose a computational predictor called SVM-GKM, which uses a full set of k-mers with gaps as features, instead of comparing the whole sequence pairs. It treats gaps as mismatches. For most of the predictors, it is critical to calculate the similarity between two elements in the feature space. The similarity score of two sequences is calculated by the kernel function. Therefore, in this section, we will describe how to calculate the kernel function of SVM-GKM.</p>
<p>First, each training sample is represented as a series of k-mers, where k is the length of subsequence. The key to calculate the GKM kernel matrix is to compute the number of mismatches between each pair of sequences for all pairs of k-mers. Here, we define a variable
<italic>m</italic>
to stand for is the length of matches, so the length of gaps is
<italic>k</italic>
<italic>m</italic>
. Then feature vector
<italic>f</italic>
<sup>
<italic> S</italic>
</sup>
of a given sequence
<italic>S</italic>
can be defined as</p>
<p>
<disp-formula id="eq2">
<inline-graphic id="d33e252" xlink:href="srep23934-m2.jpg"></inline-graphic>
</disp-formula>
</p>
<p>where
<inline-formula id="d33e255">
<inline-graphic id="d33e256" xlink:href="srep23934-m3.jpg"></inline-graphic>
</inline-formula>
is the length of the
<italic>i</italic>
<italic>th</italic>
gapped k-mer in the sequence S,
<inline-formula id="d33e264">
<inline-graphic id="d33e265" xlink:href="srep23934-m4.jpg"></inline-graphic>
</inline-formula>
stands for the number of all gapped k-mers, and
<italic>b</italic>
is the alphabet size. For DNA sequence,
<italic>b</italic>
 = 4. Then the kernel function between two sequences
<italic>S</italic>
<sub>1</sub>
and
<italic>S</italic>
<sub>2</sub>
can be defined as</p>
<p>
<disp-formula id="eq5">
<inline-graphic id="d33e286" xlink:href="srep23934-m5.jpg"></inline-graphic>
</disp-formula>
</p>
<p>Since the number of all possible gapped k-mers grows extremely rapidly as
<italic>m</italic>
increases, direct calculation of
<xref ref-type="disp-formula" rid="eq9">Eq. 3</xref>
is almost intractable
<xref ref-type="bibr" rid="b33">33</xref>
. Thus, the inner product in
<xref ref-type="disp-formula" rid="eq9">Eq. 3</xref>
is computed by the following equation:</p>
<p>
<disp-formula id="eq6">
<inline-graphic id="d33e302" xlink:href="srep23934-m6.jpg"></inline-graphic>
</disp-formula>
</p>
<p>where
<italic>n</italic>
(
<italic>n</italic>
 ≤ 
<italic>k</italic>
 − 
<italic>m</italic>
) is the number of mismatches between two k-mers
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
.
<italic>x</italic>
<sub>1</sub>
is from
<italic>S</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
is from
<italic>S</italic>
<sub>2</sub>
,
<italic>N</italic>
<sub>
<italic>n</italic>
</sub>
(
<italic>S</italic>
<sub>1</sub>
,
<italic>S</italic>
<sub>2</sub>
) is the number of pairs of k-mers with
<italic>n</italic>
mismatches in sequences
<italic>S</italic>
<sub>1</sub>
and
<italic>S</italic>
<sub>2</sub>
,
<italic>h</italic>
<sub>n</sub>
is the corresponding coefficient.
<italic>h</italic>
<sub>
<italic>n</italic>
</sub>
is defined as follows:</p>
<p>
<disp-formula id="eq7">
<inline-graphic id="d33e393" xlink:href="srep23934-m7.jpg"></inline-graphic>
</disp-formula>
</p>
<p>In order to reduce the error caused by corresponding coefficients, the following equation is used to get
<italic>h</italic>
<sub>
<italic>n</italic>
</sub>
when calculating the mismatch for two sequences</p>
<p>
<disp-formula id="eq8">
<inline-graphic id="d33e405" xlink:href="srep23934-m8.jpg"></inline-graphic>
</disp-formula>
</p>
<p>where
<italic>n</italic>
<sub>
<italic>1</italic>
</sub>
is the mismatch number that k-mer
<italic>x</italic>
<sub>1</sub>
contains,
<italic>n</italic>
<sub>
<italic>2</italic>
</sub>
is the mismatch number that k-mer
<italic>x</italic>
<sub>2</sub>
contains, and
<italic>t</italic>
is the mismatches number, which exists at the
<italic>k</italic>
 − 
<italic>n</italic>
mismatch positions for both
<italic>x</italic>
<sub>1</sub>
and
<italic>x</italic>
<sub>2</sub>
. The remaining mismatches
<italic>r</italic>
 = 
<italic>n</italic>
<sub>2</sub>
 − 
<italic>t</italic>
 − (
<italic>n</italic>
 − 
<italic>n</italic>
<sub>1</sub>
 − t) are among the the
<italic>n</italic>
mismatch positions for k-mer
<italic>x</italic>
<sub>2</sub>
.</p>
</sec>
<sec disp-level="2">
<title>Tree structure</title>
<p>In this paper, a tree structure is employed to count mismatches
<xref ref-type="bibr" rid="b33">33</xref>
so as to improve the calculation efficiency of GKM.</p>
<p>The tree is generated by training samples and we construct it by adding a path for every k-mer. Assume that s(
<italic>t</italic>
<sub>
<italic>i</italic>
</sub>
) stands for the path from the root to node
<italic>t</italic>
<sub>
<italic>i</italic>
</sub>
with depth
<italic>d</italic>
.
<italic>d</italic>
means that the corresponding sub-sequence has a length of
<italic>d</italic>
. For a tree, its maximum depth is
<italic>k</italic>
, i.e. the length of the k-mer. Therefore, for a terminal leaf node of the tree, the leaf node represents a k-mer. A terminal leaf node can also hold the list of training sequence labels, which contains the information of appeared k-mers and the number of these k-mers in each sequence. We use depth-first search (DFS)
<xref ref-type="bibr" rid="b35">35</xref>
<xref ref-type="bibr" rid="b36">36</xref>
order to search the tree and obtain the mismatch profile. Based on the method in
<xref ref-type="bibr" rid="b37">37</xref>
, we store the list of pointers to all nodes
<italic>t</italic>
<sub>
<italic>i</italic>
</sub>
at depth
<italic>d</italic>
and also store the number of mismatches between two paths s(
<italic>t</italic>
<sub>
<italic>i</italic>
</sub>
) and s(
<italic>t</italic>
<sub>
<italic>j</italic>
</sub>
). Differing from this method, our method only needs to store the values of the terminal leaf nodes and does not need to store the information of all nodes. Thus, at the end of one DFS traversal of the tree, the mismatch profiles for all pairs of sequences are completely determined.
<xref ref-type="fig" rid="f1">Figure 1</xref>
gives an example of a mismatch tree with
<italic>k</italic>
 = 3. The tree is generated by sequences
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
,
<italic>S</italic>
<sub>
<italic>2</italic>
</sub>
, and
<italic>S</italic>
<sub>
<italic>3</italic>
</sub>
. We can see that for node
<italic>t</italic>
<sub>6</sub>
, s(
<italic>t</italic>
<sub>
<italic>6</italic>
</sub>
) = ‘AAA’. Sequence
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
contains two counts of substring
<italic>s</italic>
(
<italic>t</italic>
<sub>6</sub>
), but sequence
<italic>S</italic>
<sub>
<italic>2</italic>
</sub>
and sequence
<italic>S</italic>
<sub>
<italic>3</italic>
</sub>
do not contain this substring. For our experiments, we used the gkm-SVM software v1.3
<xref ref-type="bibr" rid="b33">33</xref>
as the implementation of the gapped k-mer and tree structure, which is available at
<ext-link ext-link-type="uri" xlink:href="http://www.beerlab.org/gkmsvm/">http://www.beerlab.org/gkmsvm/</ext-link>
.</p>
</sec>
<sec disp-level="2">
<title>Support Vector Machine</title>
<p>The support vector machine (SVM) method is a widely used method for classification problems
<xref ref-type="bibr" rid="b34">34</xref>
<xref ref-type="bibr" rid="b38">38</xref>
<xref ref-type="bibr" rid="b39">39</xref>
<xref ref-type="bibr" rid="b40">40</xref>
<xref ref-type="bibr" rid="b41">41</xref>
<xref ref-type="bibr" rid="b42">42</xref>
, which is based on the structural risk minimization principle from statistical leaning theory
<xref ref-type="bibr" rid="b43">43</xref>
<xref ref-type="bibr" rid="b44">44</xref>
<xref ref-type="bibr" rid="b45">45</xref>
<xref ref-type="bibr" rid="b46">46</xref>
. The basic idea of SVM is to construct a separating hyper-plane so as to maximize the margin between positive and negative datasets. SVM first constructs a hyper-plane based on the training dataset. This step exploits the mapping matrix called kernel function to organize a discriminant equation. Then it uses the test dataset to perform classification and obtain the final results.</p>
</sec>
<sec disp-level="2">
<title>Cross-Validation</title>
<p>K-fold cross-validation is a widely used method for evaluating the performance of a computational predictor
<xref ref-type="bibr" rid="b47">47</xref>
<xref ref-type="bibr" rid="b48">48</xref>
. In this article, following previous studies
<xref ref-type="bibr" rid="b49">49</xref>
, we use 5-fold cross-validation to evaluate the performance of various predictors. First we segment the dataset into five sections, This dataset contains both recombination hotspots and recombination coldspots. Then we get four segments of both hotspots and clodspots as training dataset, and the remain segment as testing dataset. We repeat this operation till all five segments have been already used as testing dataset. Finally, we calculate the mean of the prediction accuracy as our final results.</p>
</sec>
<sec disp-level="2">
<title>Evaluation Method of the Performance</title>
<p>Here, we use four metrics, sensitivity (Sn), specificity (Sp), accuracy (Acc), and Mathew’s correlation coefficient (MCC) to test the predictor
<xref ref-type="bibr" rid="b48">48</xref>
<xref ref-type="bibr" rid="b50">50</xref>
<xref ref-type="bibr" rid="b51">51</xref>
<xref ref-type="bibr" rid="b52">52</xref>
. The following equations show us how to calculate them.</p>
<p>
<disp-formula id="eq9">
<inline-graphic id="d33e635" xlink:href="srep23934-m9.jpg"></inline-graphic>
</disp-formula>
</p>
<p>where
<italic>N</italic>
<sup>+</sup>
is the total number of the tested recombination hotspots sequences,
<inline-formula id="d33e643">
<inline-graphic id="d33e644" xlink:href="srep23934-m10.jpg"></inline-graphic>
</inline-formula>
is the number of the tested recombination hotspots which are predicted as recombination coldspots,
<italic>N</italic>
<sup></sup>
is the total number of the tested recombination coldspots sequences,
<inline-formula id="d33e651">
<inline-graphic id="d33e652" xlink:href="srep23934-m11.jpg"></inline-graphic>
</inline-formula>
is the number of the tested recombination coldspots sequences which are predicted as recombination hotspots.</p>
</sec>
</sec>
<sec disp-level="1">
<title>Results</title>
<sec disp-level="2">
<title>Performance of SVM-GKM</title>
<p>The SVM-GKM predictor is constructed by only using the gapped k-mer as a feature. We first evaluate the impact of the parameter word length
<italic>k</italic>
(see method section for details) on the performance of SVM-GKM.
<xref ref-type="fig" rid="f2">Figure 2</xref>
shows the Acc (accuracy) values obtained by the SVM-GKM using the word length
<italic>k</italic>
from 8 to 15 with match length
<italic>m</italic>
set as 7. The performance of SVM-GKM increases significantly with the growth of
<italic>k</italic>
values, and SVM-GKM achieves the best performance when
<italic>k</italic>
 = 13. These results are not surprising, because for larger
<italic>k</italic>
values, more sequence order information can be incorporated into the predictor, contributing to higher performance for recombination spot identification.</p>
</sec>
<sec disp-level="2">
<title>Performance comparison between SVM-GKM and kmer-SVM</title>
<p>The k-mer is a widely used feature considering the local sequence order information along the DNA sequences. GKM is an improvement of k-mer by introducing the gaps into k-mers. For comparison, a predictor called kmer-SVM is constructed based on k-mers. The kmer-SVM can be viewed as a special case of GKM-SVM without gaps. Therefore, the implementation of kmer-SVM is the same as that of SVM-GKM except that the gap number
<italic>n</italic>
is set as 0, and the tree structure is also employed so as to reduce the computational cost. The performance of these two methods on the benchmark dataset with different parameters is shown in
<xref ref-type="fig" rid="f2">Fig. 2</xref>
.</p>
<p>As shown in
<xref ref-type="fig" rid="f2">Fig. 2</xref>
, SVM-GKM consistently outperforms kmer-SVM, especially for lager word length values (
<italic>k</italic>
 > 9). We can also see that parameter
<italic>k</italic>
does not have significant impact on the performance of SVM-GKM, and SVM-GKM achieves its highest accuracy (86.57%) when
<italic>k</italic>
 = 13. In contrast, kmer-SVM achieves its highest accuracy (82.31%) when
<italic>k</italic>
 = 10 and then its performance decreases significantly. This is because when
<italic>k</italic>
is larger than 10, the dimension of the feature vectors is very large and many values are zeros, leading to extremely sparse problem. For example, when
<italic>k</italic>
 = 13, the dimension of the feature vectors generated by kmer-GKM is 4
<sup>13</sup>
 ≈ 6.7 × 10
<sup>7</sup>
. In contrast, for the same word length, the length of feature vectors generated by SVM-GKM is only
<inline-formula id="d33e725">
<inline-graphic id="d33e726" xlink:href="srep23934-m12.jpg"></inline-graphic>
</inline-formula>
 ≈ 7.1 × 10
<sup>6</sup>
, which is much smaller than that of kmer-SVM, and therefore, GKM can efficiently avoid the sparse problem.
<xref ref-type="fig" rid="f3">Figure 3</xref>
presents the comparison of the four performance measures between these two predictors, from which we can see that SVM-GKM outperforms kmer-SVM in terms of all the four performance measures.</p>
</sec>
<sec disp-level="2">
<title>Comparison to Other Related Methods</title>
<p>We also compare SVM-GKM with other two highly related methods, including iRSpot-PseDNC
<xref ref-type="bibr" rid="b53">53</xref>
and IDQD
<xref ref-type="bibr" rid="b17">17</xref>
. They both use the local or long range sequence order information extracted from DNA sequences for recombination spot identification, and achieve the state-of-the-art performance. The iRSpot-PseDNC exploits a novel feature vector called ‘pseudo dinucleotide composition’ based on six local DNA structural properties, including three angular parameters and three translational parameters. The IDQD method is based on sequence k-mer frequencies proposed by Liu
<italic>et al.</italic>
</p>
<p>
<xref ref-type="table" rid="t1">Table 1</xref>
shows five-fold cross-validation results of the various predictors on the benchmark dataset, from which we can see that the SVM-GKM outperforms all the other competing methods. The main reason for its better performance is that the SVM-GKM can efficiently reduce the dimension of the resulting feature vectors, and avoid the risk of sparse and overfitting problems. Therefore, we conclude that SVM-GKM would be a useful tool for recombination spot identification.</p>
</sec>
<sec disp-level="2">
<title>Feature Analysis</title>
<p>It is interesting to explore if the gapped k-mers can reflect the characteristics of the recombination spots or not. Therefore, the discriminative power of different gapped k-mers in SVM-GKM are calculated by using the Principal Component Analysis (PCA)
<xref ref-type="bibr" rid="b54">54</xref>
<xref ref-type="bibr" rid="b55">55</xref>
<xref ref-type="bibr" rid="b56">56</xref>
, and the most discriminative gapped k-mer is ‘CCG*T**C**CA*’ (*represents the gaps) according to variance ratio. Interestingly, this gapped k-mer is able to reflect the sequence characteristics of two important yeast hotspot motifs M26 and 4095
<xref ref-type="bibr" rid="b57">57</xref>
as shown in
<xref ref-type="table" rid="t2">Table 2</xref>
, indicating that the gapped k-mer feature can indeed capture the sequence patterns of the hotspots, and it can explain the reason why the SVM-GKM outperforms other computational predictors.</p>
</sec>
</sec>
<sec disp-level="1">
<title>Discussion</title>
<p>As a widely used feature in the field of recombination spot identification, k-mer only incorporates the local sequence composition information of DNA sequences. In order to overcome this disadvantage, gapped k-mer (GKM) has been proposed to incorporate the long range sequence order information and reduce the length of the feature vectors. GKM successfully overcomes the sparse problem caused by k-mers via introducing the gaps into the k-mers, and has been successfully applied to enhancer identification. In this study, we apply the concept of GKM to the field of recombination spot identification, and demonstrate that this approach can obviously improve the predictive performance. These results are not surprising, because previous studies
<xref ref-type="bibr" rid="b48">48</xref>
<xref ref-type="bibr" rid="b58">58</xref>
<xref ref-type="bibr" rid="b59">59</xref>
<xref ref-type="bibr" rid="b60">60</xref>
show that the long range or global sequence order effects are critical for constructing accurate predictors. Therefore, it is important to explore new features that can capture the characteristics of these motifs. However, it is by no mean an easy task due to the extremely sparse feature vector problem. The gapped k-mer overcomes this problem and incorporates long range sequence order information, and therefore, the proposed predictor SVM-GKM based on gapped k-mers outperforms other state-of-the-art predictors. By analyzing the most discriminative feature in SVM-GKM, it shows that the gapped k-mers indeed reflect the characteristics of some motifs of recombination spots.</p>
<p>Besides k-mer and gapped k-mer, palindrome structure, relatively high GC content, dinucleotides bias, and consensus DNA motifs have been showed useful for recombination spot identification. Our future study will focus on exploring various feature combinations to construct a computational predictor. Performance improvement can be expected by using some neural-like computing strategies, such as spiking neural models
<xref ref-type="bibr" rid="b6">6</xref>
<xref ref-type="bibr" rid="b11">11</xref>
<xref ref-type="bibr" rid="b61">61</xref>
<xref ref-type="bibr" rid="b62">62</xref>
<xref ref-type="bibr" rid="b63">63</xref>
<xref ref-type="bibr" rid="b64">64</xref>
, because these features are able to capture the characteristics of recombination spots in different aspects.</p>
</sec>
<sec disp-level="1">
<title>Additional Information</title>
<p>
<bold>How to cite this article</bold>
: Wang, R.
<italic>et al.</italic>
Recombination spot identification Based on gapped k-mers.
<italic>Sci. Rep.</italic>
<bold>6</bold>
, 23934; doi: 10.1038/srep23934 (2016).</p>
</sec>
<sec sec-type="supplementary-material" id="S1">
<title>Supplementary Material</title>
<supplementary-material id="d33e35" content-type="local-data">
<caption>
<title>Supplementary Information</title>
</caption>
<media xlink:href="srep23934-s1.pdf"></media>
</supplementary-material>
</sec>
</body>
<back>
<ack>
<p>This work was supported by the National Natural Science Foundation of China (No. 61300112), the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry, the Natural Science Foundation of Guangdong Province (2014A030313695), Shenzhen Municipal Science and Technology Innovation Council (Grant No. CXZZ20140904154910774), and Scientific Research Foundation in Shenzhen (Grant No. CYJ20150626110425228).</p>
</ack>
<ref-list>
<ref id="b1">
<mixed-citation publication-type="journal">
<name>
<surname>Chen</surname>
<given-names>W.</given-names>
</name>
,
<name>
<surname>Feng</surname>
<given-names>P.</given-names>
</name>
,
<name>
<surname>Lin</surname>
<given-names>H.</given-names>
</name>
&
<name>
<surname>Chou</surname>
<given-names>K.</given-names>
</name>
<article-title>iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition</article-title>
.
<source>Nucleic Acids Res</source>
<volume>41</volume>
,
<fpage>e68</fpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23303794</pub-id>
</mixed-citation>
</ref>
<ref id="b2">
<mixed-citation publication-type="journal">
<name>
<surname>Arnheim</surname>
<given-names>N.</given-names>
</name>
,
<name>
<surname>Calabrese</surname>
<given-names>P.</given-names>
</name>
&
<name>
<surname>Tiemann-Boege</surname>
<given-names>I.</given-names>
</name>
<article-title>Mammalian meiotic recombination hot spots</article-title>
.
<source>Annu Rev Genet.</source>
<volume>41</volume>
,
<fpage>369</fpage>
<lpage>399</lpage>
(
<year>2007</year>
).
<pub-id pub-id-type="pmid">18076329</pub-id>
</mixed-citation>
</ref>
<ref id="b3">
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
,
<name>
<surname>Tian</surname>
<given-names>Y.</given-names>
</name>
,
<name>
<surname>Cheng</surname>
<given-names>R.</given-names>
</name>
&
<name>
<surname>Jin</surname>
<given-names>Y.</given-names>
</name>
<article-title>An efficient approach to non-dominated sorting for evolutionary multi-objective optimization</article-title>
.
<source>IEEE T Evolut Comput</source>
<volume>19</volume>
,
<fpage>201</fpage>
<lpage>213</lpage>
(
<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b4">
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
,
<name>
<surname>Tian</surname>
<given-names>Y.</given-names>
</name>
&
<name>
<surname>Jin</surname>
<given-names>Y.</given-names>
</name>
<article-title>A knee point driven evolutionary algorithm for many-objective optimization</article-title>
.
<source>IEEE T Evolut Comput</source>
<volume>19</volume>
,
<fpage>761</fpage>
<lpage>776</lpage>
(
<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b5">
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>L.</given-names>
</name>
<italic>et al.</italic>
<article-title>Sequence-based identification of recombination spots using pseudo nucleic acid representation and recursive feature extraction by linear kernel SVM</article-title>
.
<source>BMC Bioinformatics</source>
<volume>15</volume>
,
<fpage>340</fpage>
<lpage>340</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">25409550</pub-id>
</mixed-citation>
</ref>
<ref id="b6">
<mixed-citation publication-type="journal">
<name>
<surname>Wei</surname>
<given-names>L.</given-names>
</name>
<italic>et al.</italic>
<article-title>Improved and Promising Identification of Human MicroRNAs by Incorporating a High-quality Negative Set</article-title>
.
<source>IEEE/ACM Trans Comput Biol Bioinform</source>
<volume>11</volume>
,
<fpage>192</fpage>
<lpage>201</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">26355518</pub-id>
</mixed-citation>
</ref>
<ref id="b7">
<mixed-citation publication-type="journal">
<name>
<surname>Weyn</surname>
<given-names>B.</given-names>
</name>
<italic>et al.</italic>
<article-title>Determination of tumour prognosis based on angiogenesis-related vascular patterns measured by fractal and syntactic structure analysis</article-title>
.
<source>Clinical Oncology</source>
<volume>16</volume>
,
<fpage>307</fpage>
<lpage>316</lpage>
(
<year>2004</year>
).
<pub-id pub-id-type="pmid">15214656</pub-id>
</mixed-citation>
</ref>
<ref id="b8">
<mixed-citation publication-type="journal">
<name>
<surname>Zou</surname>
<given-names>Q.</given-names>
</name>
,
<name>
<surname>Chen</surname>
<given-names>W.</given-names>
</name>
,
<name>
<surname>Huang</surname>
<given-names>Y.</given-names>
</name>
,
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
&
<name>
<surname>Jiang</surname>
<given-names>Y.</given-names>
</name>
<article-title>Identifying Multi-functional Enzyme with Hierarchical Multi-label Classifier</article-title>
.
<source>J Comput Theor Nanos</source>
<volume>10</volume>
,
<fpage>1038</fpage>
<lpage>1043</lpage>
(
<year>2013</year>
).</mixed-citation>
</ref>
<ref id="b9">
<mixed-citation publication-type="journal">
<name>
<surname>Peng</surname>
<given-names>J.</given-names>
</name>
<italic>et al.</italic>
<article-title>DYMHC: detecting the yeast meiotic recombination hotspots and coldspots by random forest model using gapped dinucleotide composition features</article-title>
.
<source>Nucleic Acids Res</source>
<volume>35</volume>
,
<fpage>W47</fpage>
<lpage>W51</lpage>
(
<year>2008</year>
).</mixed-citation>
</ref>
<ref id="b10">
<mixed-citation publication-type="journal">
<name>
<surname>Cheng</surname>
<given-names>X.-Y.</given-names>
</name>
<italic>et al.</italic>
<article-title>A Global Characterization and Identification of Multifunctional Enzymes</article-title>
.
<source>PLoS One</source>
<volume>7</volume>
,
<fpage>e38979</fpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22723914</pub-id>
</mixed-citation>
</ref>
<ref id="b11">
<mixed-citation publication-type="journal">
<name>
<surname>Zeng</surname>
<given-names>X.</given-names>
</name>
,
<name>
<surname>Xu</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
&
<name>
<surname>Pan</surname>
<given-names>L.</given-names>
</name>
<article-title>On languages generated by spiking neural P systems with weights</article-title>
.
<source>Information Sciences</source>
<volume>278</volume>
,
<fpage>423</fpage>
<lpage>433</lpage>
(
<year>2014</year>
).</mixed-citation>
</ref>
<ref id="b12">
<mixed-citation publication-type="journal">
<name>
<surname>Lin</surname>
<given-names>C.</given-names>
</name>
<italic>et al.</italic>
<article-title>Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier</article-title>
.
<source>PLoS One</source>
<volume>8</volume>
,
<fpage>e56499</fpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23437146</pub-id>
</mixed-citation>
</ref>
<ref id="b13">
<mixed-citation publication-type="journal">
<name>
<surname>Zou</surname>
<given-names>Q.</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
,
<name>
<surname>Jiang</surname>
<given-names>Y.</given-names>
</name>
,
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
&
<name>
<surname>Wang</surname>
<given-names>G.</given-names>
</name>
<article-title>BinMemPredict: a Web server and software for predicting membrane protein types</article-title>
.
<source>Curr Proteomics</source>
<volume>10</volume>
,
<fpage>2</fpage>
<lpage>9</lpage>
(
<year>2013</year>
).</mixed-citation>
</ref>
<ref id="b14">
<mixed-citation publication-type="journal">
<name>
<surname>Zou</surname>
<given-names>Q.</given-names>
</name>
<italic>et al.</italic>
<article-title>Improving tRNAscan-SE annotation results via ensemble classifiers</article-title>
.
<source>Mol Inform</source>
<volume>34</volume>
,
<fpage>761</fpage>
<lpage>770</lpage>
(
<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b15">
<mixed-citation publication-type="journal">
<name>
<surname>Zou</surname>
<given-names>Q.</given-names>
</name>
,
<name>
<surname>Zeng</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Cao</surname>
<given-names>L.</given-names>
</name>
&
<name>
<surname>Ji</surname>
<given-names>R.</given-names>
</name>
<article-title>A Novel Features Ranking Metric with Application to Scalable Visual and Bioinformatics Data Classification</article-title>
.
<source>Neurocomputing</source>
<volume>173</volume>
,
<fpage>346</fpage>
<lpage>354</lpage>
(
<year>2016</year>
).</mixed-citation>
</ref>
<ref id="b16">
<mixed-citation publication-type="journal">
<name>
<surname>Gerton</surname>
<given-names>J. L.</given-names>
</name>
<italic>et al.</italic>
<article-title>Global Mapping of Meiotic Recombination Hotspots and Coldspots in the Yeast Saccharomyces cerevisiae</article-title>
.
<source>P Natl Acad Sci USA</source>
<volume>97</volume>
,
<fpage>11383</fpage>
<lpage>11390</lpage>
(
<year>2000</year>
).</mixed-citation>
</ref>
<ref id="b17">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>G.</given-names>
</name>
,
<name>
<surname>Jia</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Cui</surname>
<given-names>X.</given-names>
</name>
&
<name>
<surname>Lu</surname>
<given-names>C.</given-names>
</name>
<article-title>Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae</article-title>
.
<source>J Theor Biol</source>
<volume>293</volume>
,
<fpage>49</fpage>
<lpage>54</lpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22016025</pub-id>
</mixed-citation>
</ref>
<ref id="b18">
<mixed-citation publication-type="journal">
<name>
<surname>Nanni</surname>
<given-names>L.</given-names>
</name>
&
<name>
<surname>Lumini</surname>
<given-names>A.</given-names>
</name>
<article-title>Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization</article-title>
.
<source>Amino Acids</source>
<volume>34</volume>
,
<fpage>653</fpage>
<lpage>660</lpage>
(
<year>2008</year>
).
<pub-id pub-id-type="pmid">18175047</pub-id>
</mixed-citation>
</ref>
<ref id="b19">
<mixed-citation publication-type="journal">
<name>
<surname>Sahu</surname>
<given-names>S. S.</given-names>
</name>
&
<name>
<surname>Panda</surname>
<given-names>G.</given-names>
</name>
<article-title>Brief Communication: A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction</article-title>
.
<source>Comput Biol Chem</source>
<volume>34</volume>
,
<fpage>320</fpage>
<lpage>327</lpage>
(
<year>2010</year>
).
<pub-id pub-id-type="pmid">21106461</pub-id>
</mixed-citation>
</ref>
<ref id="b20">
<mixed-citation publication-type="journal">
<name>
<surname>Nanni</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Lumini</surname>
<given-names>A.</given-names>
</name>
,
<name>
<surname>Gupta</surname>
<given-names>D.</given-names>
</name>
&
<name>
<surname>Garg</surname>
<given-names>A.</given-names>
</name>
<article-title>Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou’s pseudo amino acid composition and on evolutionary information</article-title>
.
<source>IEEE/ACM Trans Comput Biol Bioinform</source>
<volume>9</volume>
,
<fpage>467</fpage>
<lpage>475</lpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">21860064</pub-id>
</mixed-citation>
</ref>
<ref id="b21">
<mixed-citation publication-type="journal">
<name>
<surname>Chou</surname>
<given-names>K.</given-names>
</name>
&
<name>
<surname>Com</surname>
<given-names>M. P.</given-names>
</name>
<article-title>Prediction of protein cellular attributes using pseudo-amino acid composition</article-title>
.
<source>Proteins</source>
<volume>43</volume>
,
<fpage>246</fpage>
<lpage>255</lpage>
(
<year>2001</year>
).
<pub-id pub-id-type="pmid">11288174</pub-id>
</mixed-citation>
</ref>
<ref id="b22">
<mixed-citation publication-type="journal">
<name>
<surname>Getun</surname>
<given-names>I. V.</given-names>
</name>
,
<name>
<surname>Wu</surname>
<given-names>Z. K.</given-names>
</name>
,
<name>
<surname>Khalil</surname>
<given-names>A. M.</given-names>
</name>
&
<name>
<surname>Bois</surname>
<given-names>P. R. J.</given-names>
</name>
<article-title>Nucleosome occupancy landscape and dynamics at mouse recombination hotspots</article-title>
.
<source>Embo Rep</source>
<volume>11</volume>
,
<fpage>555</fpage>
<lpage>560</lpage>
(
<year>2010</year>
).
<pub-id pub-id-type="pmid">20508641</pub-id>
</mixed-citation>
</ref>
<ref id="b23">
<mixed-citation publication-type="journal">
<name>
<surname>Nasar</surname>
<given-names>F.</given-names>
</name>
,
<name>
<surname>Jankowski</surname>
<given-names>C.</given-names>
</name>
&
<name>
<surname>Nag</surname>
<given-names>D. K.</given-names>
</name>
<article-title>Long palindromic sequences induce double-strand breaks during meiosis in yeast</article-title>
.
<source>Mol Cell Biol</source>
<volume>20</volume>
,
<fpage>3449</fpage>
<lpage>3458</lpage>
(
<year>2000</year>
).
<pub-id pub-id-type="pmid">10779335</pub-id>
</mixed-citation>
</ref>
<ref id="b24">
<mixed-citation publication-type="journal">
<name>
<surname>Wei</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Liao</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Gao</surname>
<given-names>X.</given-names>
</name>
&
<name>
<surname>Zou</surname>
<given-names>Q.</given-names>
</name>
<article-title>An Improved Protein Structural Prediction Method by Incorporating Both Sequence and Structure Information</article-title>
.
<source>IEEE T Nanobiosci</source>
<volume>14</volume>
,
<fpage>339</fpage>
<lpage>349</lpage>
(
<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b25">
<mixed-citation publication-type="journal">
<name>
<surname>Meunier</surname>
<given-names>J.</given-names>
</name>
&
<name>
<surname>Duret</surname>
<given-names>L.</given-names>
</name>
<article-title>Recombination drives the evolution of GC-content in the human genome</article-title>
.
<source>Mol Biol Evol</source>
<volume>21</volume>
,
<fpage>984</fpage>
<lpage>990</lpage>
(
<year>2004</year>
).
<pub-id pub-id-type="pmid">14963104</pub-id>
</mixed-citation>
</ref>
<ref id="b26">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>G.</given-names>
</name>
&
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<article-title>The correlation between recombination rate and dinucleotide bias in Drosophila melanogaster</article-title>
.
<source>J Mol Evol</source>
<volume>67</volume>
,
<fpage>358</fpage>
<lpage>367</lpage>
(
<year>2008</year>
).
<pub-id pub-id-type="pmid">18797953</pub-id>
</mixed-citation>
</ref>
<ref id="b27">
<mixed-citation publication-type="journal">
<name>
<surname>Myers</surname>
<given-names>S.</given-names>
</name>
,
<name>
<surname>Freeman</surname>
<given-names>C.</given-names>
</name>
,
<name>
<surname>Auton</surname>
<given-names>A.</given-names>
</name>
,
<name>
<surname>Donnelly</surname>
<given-names>P.</given-names>
</name>
&
<name>
<surname>Mcvean</surname>
<given-names>G.</given-names>
</name>
<article-title>A common sequence motif associated with recombination hot spots and genome instability in humans</article-title>
.
<source>Nat Genet</source>
<volume>40</volume>
,
<fpage>1124</fpage>
<lpage>1129</lpage>
(
<year>2008</year>
).
<pub-id pub-id-type="pmid">19165926</pub-id>
</mixed-citation>
</ref>
<ref id="b28">
<mixed-citation publication-type="journal">
<name>
<surname>Christopher</surname>
<given-names>F. B.</given-names>
</name>
,
<name>
<surname>Dongwon</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Mccallion</surname>
<given-names>A. S.</given-names>
</name>
&
<name>
<surname>Beer</surname>
<given-names>M.</given-names>
</name>
<article-title>A. kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets</article-title>
.
<source>Nucleic Acids Res</source>
<volume>41</volume>
,
<fpage>W544</fpage>
<lpage>556</lpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23771147</pub-id>
</mixed-citation>
</ref>
<ref id="b29">
<mixed-citation publication-type="journal">
<name>
<surname>Ghandi</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Mohammad-Noori</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Beer</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Robust k-mer frequency estimation using gapped k-mers</article-title>
.
<source>J Math Biol</source>
<volume>69</volume>
,
<fpage>469</fpage>
<lpage>500</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">23861010</pub-id>
</mixed-citation>
</ref>
<ref id="b30">
<mixed-citation publication-type="journal">
<name>
<surname>Lee</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Karchin</surname>
<given-names>R.</given-names>
</name>
&
<name>
<surname>Beer</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Discriminative prediction of mammalian enhancers from DNA sequence</article-title>
.
<source>Genome Research</source>
<volume>21</volume>
<bold>(12)</bold>
,
<fpage>2167</fpage>
<lpage>2180</lpage>
(
<year>2011</year>
).
<pub-id pub-id-type="pmid">21875935</pub-id>
</mixed-citation>
</ref>
<ref id="b31">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<italic>et al.</italic>
<article-title>Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences</article-title>
.
<source>Nucleic Acids Res</source>
<volume>W1</volume>
,
<fpage>W65</fpage>
<lpage>W71</lpage>
(
<year>2015</year>
).
<pub-id pub-id-type="pmid">25958395</pub-id>
</mixed-citation>
</ref>
<ref id="b32">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<italic>et al.</italic>
<article-title>PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation</article-title>
.
<source>Mol Inform</source>
<volume>34</volume>
,
<fpage>8</fpage>
<lpage>17</lpage>
(
<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b33">
<mixed-citation publication-type="journal">
<name>
<surname>Ghandi</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Lee</surname>
<given-names>D.</given-names>
</name>
,
<name>
<surname>Mohammad-Noori</surname>
<given-names>M.</given-names>
</name>
&
<name>
<surname>Beer</surname>
<given-names>M. A.</given-names>
</name>
<article-title>Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features</article-title>
.
<source>PLoS Comput Biol</source>
<volume>10</volume>
<bold>(7)</bold>
, (
<year>2014</year>
).</mixed-citation>
</ref>
<ref id="b34">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
,
<name>
<surname>Fang</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Jie</surname>
<given-names>C.</given-names>
</name>
,
<name>
<surname>Liu</surname>
<given-names>F.</given-names>
</name>
&
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<article-title>miRNA-dis: microRNA precursor identification based on distance structure status pairs</article-title>
.
<source>Mol Biosyst</source>
<volume>11</volume>
,
<fpage>1194</fpage>
<lpage>1204</lpage>
(
<year>2015</year>
).
<pub-id pub-id-type="pmid">25715848</pub-id>
</mixed-citation>
</ref>
<ref id="b35">
<mixed-citation publication-type="journal">
<name>
<surname>Quek</surname>
<given-names>L. E.</given-names>
</name>
&
<name>
<surname>Nielsen</surname>
<given-names>L. K.</given-names>
</name>
<article-title>A depth-first search algorithm to compute elementary flux modes by linear programming</article-title>
.
<source>BMC Syst Biol</source>
<volume>8</volume>
,
<fpage>1</fpage>
<lpage>10</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24393148</pub-id>
</mixed-citation>
</ref>
<ref id="b36">
<mixed-citation publication-type="journal">
<name>
<surname>Zhu</surname>
<given-names>T.</given-names>
</name>
<italic>et al.</italic>
<article-title>A metabolic network analysis & NMR experiment design tool with user interface-driven model construction for depth-first search analysis</article-title>
.
<source>Matab Eng</source>
<volume>5</volume>
,
<fpage>74</fpage>
<lpage>85</lpage>
(
<year>2003</year>
).</mixed-citation>
</ref>
<ref id="b37">
<mixed-citation publication-type="journal">
<name>
<surname>Leslie</surname>
<given-names>C. S.</given-names>
</name>
,
<name>
<surname>Eskin</surname>
<given-names>E.</given-names>
</name>
,
<name>
<surname>Cohen</surname>
<given-names>A.</given-names>
</name>
,
<name>
<surname>Weston</surname>
<given-names>J.</given-names>
</name>
&
<name>
<surname>Noble</surname>
<given-names>W. S.</given-names>
</name>
<article-title>Mismatch string kernels for discriminative protein classification</article-title>
.
<source>Bioinformatics</source>
<volume>20</volume>
,
<fpage>467</fpage>
<lpage>476</lpage>
(
<year>2004</year>
).
<pub-id pub-id-type="pmid">14990442</pub-id>
</mixed-citation>
</ref>
<ref id="b38">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<italic>et al.</italic>
<article-title>Identification of real microRNA precursors with a pseudo structure status composition approach</article-title>
.
<source>PLoS One</source>
<volume>10</volume>
,
<fpage>e0121501</fpage>
(
<year>2015</year>
).
<pub-id pub-id-type="pmid">25821974</pub-id>
</mixed-citation>
</ref>
<ref id="b39">
<mixed-citation publication-type="journal">
<name>
<surname>Zeng</surname>
<given-names>X.</given-names>
</name>
,
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
&
<name>
<surname>Zou</surname>
<given-names>Q.</given-names>
</name>
<article-title>Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks</article-title>
.
<source>Briefings in bioinformatic.</source>
<fpage>bbv033</fpage>
(
<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b40">
<mixed-citation publication-type="journal">
<name>
<surname>Chen</surname>
<given-names>W.</given-names>
</name>
,
<name>
<surname>Feng</surname>
<given-names>P.</given-names>
</name>
&
<name>
<surname>Lin</surname>
<given-names>H.</given-names>
</name>
<article-title>Prediction of replication origins by calculating DNA structural properties</article-title>
.
<source>FEBS Letters</source>
<volume>23</volume>
,
<fpage>934</fpage>
<lpage>938</lpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22449982</pub-id>
</mixed-citation>
</ref>
<ref id="b41">
<mixed-citation publication-type="journal">
<name>
<surname>Chen</surname>
<given-names>W.</given-names>
</name>
<italic>et al.</italic>
<article-title>iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties</article-title>
.
<source>PLoS One</source>
<volume>7</volume>
,
<fpage>e47843</fpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">23144709</pub-id>
</mixed-citation>
</ref>
<ref id="b42">
<mixed-citation publication-type="journal">
<name>
<surname>Chen</surname>
<given-names>W.</given-names>
</name>
,
<name>
<surname>Feng</surname>
<given-names>P.-M.</given-names>
</name>
,
<name>
<surname>Lin</surname>
<given-names>H.</given-names>
</name>
&
<name>
<surname>Chou</surname>
<given-names>K.-C.</given-names>
</name>
<article-title>iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition</article-title>
.
<source>Biomed Res Int.</source>
<volume>2014</volume>
,
<fpage>623149</fpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24967386</pub-id>
</mixed-citation>
</ref>
<ref id="b43">
<mixed-citation publication-type="journal">
<name>
<surname>Manoj</surname>
<given-names>B.</given-names>
</name>
&
<name>
<surname>Raghava</surname>
<given-names>G. P. S.</given-names>
</name>
<article-title>ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST</article-title>
.
<source>Nucleic Acids Res</source>
<volume>32</volume>
,
<fpage>W414</fpage>
<lpage>W419</lpage>
(
<year>2004</year>
).
<pub-id pub-id-type="pmid">15215421</pub-id>
</mixed-citation>
</ref>
<ref id="b44">
<mixed-citation publication-type="journal">
<name>
<surname>Hua</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Sun</surname>
<given-names>Z.</given-names>
</name>
<article-title>Support vector machine approach for protein subcellular localization prediction</article-title>
.
<source>Bioinformatics</source>
<volume>17</volume>
,
<fpage>721</fpage>
<lpage>728</lpage>
(
<year>2001</year>
).
<pub-id pub-id-type="pmid">11524373</pub-id>
</mixed-citation>
</ref>
<ref id="b45">
<mixed-citation publication-type="journal">
<name>
<surname>Bhasin</surname>
<given-names>M.</given-names>
</name>
,
<name>
<surname>Reinherz</surname>
<given-names>E. L.</given-names>
</name>
&
<name>
<surname>Reche</surname>
<given-names>P. A.</given-names>
</name>
<article-title>Recognition and classification of histones using support vector machine</article-title>
.
<source>Review of Economics & Statistics</source>
<volume>13</volume>
,
<fpage>102</fpage>
<lpage>112</lpage>
(
<year>2006</year>
).</mixed-citation>
</ref>
<ref id="b46">
<mixed-citation publication-type="journal">
<name>
<surname>Leslie</surname>
<given-names>C.</given-names>
</name>
,
<name>
<surname>Eskin</surname>
<given-names>E.</given-names>
</name>
&
<name>
<surname>Noble</surname>
<given-names>W. S.</given-names>
</name>
<article-title>The spectrum kernel: a string kernel for SVM protein classification</article-title>
.
<source>Pac Symp Biocomput</source>
,
<fpage>564</fpage>
<lpage>575</lpage>
(
<year>2002</year>
).
<pub-id pub-id-type="pmid">11928508</pub-id>
</mixed-citation>
</ref>
<ref id="b47">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
,
<name>
<surname>Chen</surname>
<given-names>J.</given-names>
</name>
&
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<article-title>Application of Learning to Rank to protein remote homology detection</article-title>
<source>Bioinformatics</source>
,
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv413</pub-id>
(
<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b48">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
,
<name>
<surname>Fang</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Long</surname>
<given-names>R.</given-names>
</name>
,
<name>
<surname>Lan</surname>
<given-names>X.</given-names>
</name>
&
<name>
<surname>Chou</surname>
<given-names>K.-C.</given-names>
</name>
<article-title>iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition</article-title>
.
<source>Bioinformaitcs</source>
,
<pub-id pub-id-type="doi">10.1093/bioinformatics/btv604</pub-id>
(
<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b49">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<italic>et al.</italic>
<article-title>iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition</article-title>
.
<source>PLoS One</source>
<volume>9</volume>
,
<fpage>e106691</fpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">25184541</pub-id>
</mixed-citation>
</ref>
<ref id="b50">
<mixed-citation publication-type="journal">
<name>
<surname>Chen</surname>
<given-names>J.</given-names>
</name>
,
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
&
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<article-title>iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions</article-title>
.
<source>SCI Rep-UK</source>
<volume>6</volume>
,
<fpage>19062</fpage>
(
<year>2016</year>
).</mixed-citation>
</ref>
<ref id="b51">
<mixed-citation publication-type="journal">
<name>
<surname>Yang</surname>
<given-names>S.</given-names>
</name>
<italic>et al.</italic>
<article-title>Representation of fluctuation features in pathological knee joint vibroarthrographic signals using kernel density modeling method</article-title>
.
<source>Medical Engineering and Physics</source>
<volume>36</volume>
,
<fpage>1305</fpage>
<lpage>1311</lpage>
,
<pub-id pub-id-type="doi">10.1016/j.medengphy.2014.07.008</pub-id>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">25096412</pub-id>
</mixed-citation>
</ref>
<ref id="b52">
<mixed-citation publication-type="journal">
<name>
<surname>Yang</surname>
<given-names>S.</given-names>
</name>
<italic>et al.</italic>
<article-title>Effective dysphonia detection using feature dimension reduction and kernel density estimation for patients with {Parkinson’s} disease</article-title>
.
<source>PLOS ONE</source>
<volume>9</volume>
,
<fpage>e88825</fpage>
,
<pub-id pub-id-type="doi">10.1371/journal.pone.0088825</pub-id>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24586406</pub-id>
</mixed-citation>
</ref>
<ref id="b53">
<mixed-citation publication-type="journal">
<name>
<surname>Wei</surname>
<given-names>C.</given-names>
</name>
,
<name>
<surname>Peng-Mian</surname>
<given-names>F.</given-names>
</name>
,
<name>
<surname>Hao</surname>
<given-names>L.</given-names>
</name>
&
<name>
<surname>Kuo-Chen</surname>
<given-names>C.</given-names>
</name>
<article-title>iRSpot-pseDNC: identify recombination spots with pseudo dinucleotide composition</article-title>
.
<source>Nucleic Acids Res</source>
<volume>41</volume>
,
<fpage>e68</fpage>
(
<year>2013</year>
).
<pub-id pub-id-type="pmid">23303794</pub-id>
</mixed-citation>
</ref>
<ref id="b54">
<mixed-citation publication-type="journal">
<name>
<surname>Chen</surname>
<given-names>S.</given-names>
</name>
&
<name>
<surname>Zhu</surname>
<given-names>Y.</given-names>
</name>
<article-title>Subpattern-based principle component analysis</article-title>
.
<source>Pattern Recogn</source>
<volume>37</volume>
,
<fpage>1081</fpage>
<lpage>1083</lpage>
(
<year>2004</year>
).</mixed-citation>
</ref>
<ref id="b55">
<mixed-citation publication-type="journal">
<name>
<surname>Smith</surname>
<given-names>L. I.</given-names>
</name>
<article-title>A Tutorial on Principle Component Analysis</article-title>
.
<source>Eprint Arxiv</source>
<volume>58</volume>
,
<fpage>219</fpage>
<lpage>226</lpage>
(
<year>2002</year>
).</mixed-citation>
</ref>
<ref id="b56">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
,
<name>
<surname>Chen</surname>
<given-names>J.</given-names>
</name>
&
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<article-title>Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis</article-title>
.
<source>Mol Genet Genomics</source>
<volume>290</volume>
,
<fpage>1919</fpage>
<lpage>1931</lpage>
(
<year>2015</year>
).
<pub-id pub-id-type="pmid">25896721</pub-id>
</mixed-citation>
</ref>
<ref id="b57">
<mixed-citation publication-type="journal">
<name>
<surname>Steiner</surname>
<given-names>W. W.</given-names>
</name>
&
<name>
<surname>Steiner</surname>
<given-names>E. M.</given-names>
</name>
<article-title>Fission Yeast Hotspot Sequence Motifs Are Also Active in Budding Yeast</article-title>
.
<source>PloS One</source>
<volume>7</volume>
,
<fpage>83</fpage>
<lpage>83</lpage>
(
<year>2012</year>
).</mixed-citation>
</ref>
<ref id="b58">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<italic>et al.</italic>
<article-title>Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy</article-title>
.
<source>J Theor Biol</source>
<volume>385</volume>
,
<fpage>153</fpage>
<lpage>159</lpage>
(
<year>2015</year>
).
<pub-id pub-id-type="pmid">26362104</pub-id>
</mixed-citation>
</ref>
<ref id="b59">
<mixed-citation publication-type="journal">
<name>
<surname>Getun</surname>
<given-names>I. V.</given-names>
</name>
,
<name>
<surname>Wu</surname>
<given-names>Z. K.</given-names>
</name>
&
<name>
<surname>Bois</surname>
<given-names>P. R. J.</given-names>
</name>
<article-title>Organization and roles of nucleosomes at mouse meiotic recombination hotspots</article-title>
.
<source>Nucleus</source>
<volume>3</volume>
,
<fpage>244</fpage>
<lpage>250</lpage>
(
<year>2012</year>
).
<pub-id pub-id-type="pmid">22572955</pub-id>
</mixed-citation>
</ref>
<ref id="b60">
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
,
<name>
<surname>Fang</surname>
<given-names>L.</given-names>
</name>
,
<name>
<surname>Liu</surname>
<given-names>F.</given-names>
</name>
,
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
&
<name>
<surname>Chou</surname>
<given-names>K.-C.</given-names>
</name>
<article-title>iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach</article-title>
.
<source>J Biomol Struct Dyn</source>
<volume>34</volume>
,
<fpage>220</fpage>
<lpage>232</lpage>
(
<year>2016</year>
).</mixed-citation>
</ref>
<ref id="b61">
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
,
<name>
<surname>Pan</surname>
<given-names>L.</given-names>
</name>
&
<name>
<surname>Păun</surname>
<given-names>A.</given-names>
</name>
<article-title>On universality of axon P systems</article-title>
.
<source>IEEE T Neur Net Lear</source>
<volume>26</volume>
,
<fpage>2816</fpage>
<lpage>2829</lpage>
(
<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b62">
<mixed-citation publication-type="journal">
<name>
<surname>Song</surname>
<given-names>T.</given-names>
</name>
&
<name>
<surname>Pan</surname>
<given-names>L.</given-names>
</name>
<article-title>On the Universality and Non-universality of Spiking Neural P Systems with Rules on Synapses</article-title>
.
<source>IEEE Trans on Nanobioscience</source>
,
<pub-id pub-id-type="doi">10.1109/TNB.2015.2503603</pub-id>
(
<year>2015</year>
).</mixed-citation>
</ref>
<ref id="b63">
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
,
<name>
<surname>Zeng</surname>
<given-names>X.</given-names>
</name>
,
<name>
<surname>Luo</surname>
<given-names>B.</given-names>
</name>
&
<name>
<surname>Pan</surname>
<given-names>L.</given-names>
</name>
<article-title>On some classes of sequential spiking neural P systems</article-title>
.
<source>Neural Comput</source>
<volume>26</volume>
,
<fpage>974</fpage>
<lpage>997</lpage>
(
<year>2014</year>
).
<pub-id pub-id-type="pmid">24555456</pub-id>
</mixed-citation>
</ref>
<ref id="b64">
<mixed-citation publication-type="journal">
<name>
<surname>Song</surname>
<given-names>T.</given-names>
</name>
&
<name>
<surname>Pan</surname>
<given-names>L.</given-names>
</name>
<article-title>Spiking Neural P Systems with Rules on Synapses Working in Maximum Spikes Consumption Strategy</article-title>
.
<source>IEEE Trans on Nanobioscience</source>
<volume>14</volume>
,
<fpage>37</fpage>
<lpage>43</lpage>
(
<year>2015</year>
).</mixed-citation>
</ref>
</ref-list>
<fn-group>
<fn>
<p>
<bold>Author Contributions</bold>
B.L. and Y.X. conceived of the study and designed the experiments, participated in designing the study, drafting the manuscript and performing the statistical analysis. R.W. participated in coding the experiments and drafting the manuscript. B.L. and R.X. participated in performing the statistical analysis. All authors read and approved the final manuscript.</p>
</fn>
</fn-group>
</back>
<floats-group>
<fig id="f1">
<label>Figure 1</label>
<caption>
<title>An example to show the tree structure of k-mer counting.</title>
<p>This example only contains two alphabets, A and T. We use
<italic>k</italic>
 = 3 and three sequences
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
 = AAAAT,
<italic>S</italic>
<sub>
<italic>2</italic>
</sub>
 = ATTTT, and
<italic>S</italic>
<sub>
<italic>3</italic>
</sub>
 = AATA to build k-mer tree. Each node
<italic>t</italic>
<sub>
<italic>i</italic>
</sub>
at depth
<italic>d</italic>
represents a sequence of length
<italic>d</italic>
, denoted by
<italic>s(t</italic>
<sub>
<italic>i</italic>
</sub>
), which is determined by the path from the root of the tree to t
<sub>i</sub>
. At depth
<italic>d</italic>
 = 3, for node t
<sub>6</sub>
,
<italic>s(t</italic>
<sub>
<italic>6</italic>
</sub>
) = ‘AAA’,
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
contains two counts of this k-mer,
<italic>S</italic>
<sub>
<italic>2</italic>
</sub>
and
<italic>S</italic>
<sub>
<italic>3</italic>
</sub>
do not contain this k-mer. For node
<italic>t</italic>
<sub>
<italic>7</italic>
</sub>
,
<italic>s(t</italic>
<sub>
<italic>7</italic>
</sub>
) = ‘AAT’,
<italic>S</italic>
<sub>
<italic>1</italic>
</sub>
and
<italic>S</italic>
<sub>
<italic>3</italic>
</sub>
both contain one count, and
<italic>S</italic>
<sub>
<italic>2</italic>
</sub>
does not contain this k-mer. Compared
<italic>t</italic>
<sub>
<italic>6</italic>
</sub>
with
<italic>t</italic>
<sub>
<italic>7</italic>
</sub>
, the paths to these two nodes only contain one mismatch.</p>
</caption>
<graphic xlink:href="srep23934-f1"></graphic>
</fig>
<fig id="f2">
<label>Figure 2</label>
<caption>
<title>The influence of parameter
<italic>k</italic>
on the performance of two predictors.</title>
<p>Two predictors, one is SVM-GKM, the other is kmer-SVM. We consider the word length
<italic>k</italic>
from 8 to 15, and choose the mismatch length
<italic>m</italic>
 = 7 for SVM-GKM predictor. SVM-GKM achieves the highest result when
<italic>k</italic>
 = 13, kmer-SVM obtains the highest result when
<italic>k</italic>
 = 10.</p>
</caption>
<graphic xlink:href="srep23934-f2"></graphic>
</fig>
<fig id="f3">
<label>Figure 3</label>
<caption>
<title>Comparison of SVM-GKM and kmer-SVM with four performance measures.</title>
<p>This figure shows the best results that SVM-GKM and kmer-SVM achieved, where word length
<italic>k</italic>
 = 13 and matches length
<italic>m</italic>
 = 7 for SVM-GKM, and word length
<italic>k</italic>
 = 10 for kmer-SVM. SVM-GKM outperforms kmer-SVM in terms of all the four performance measures.</p>
</caption>
<graphic xlink:href="srep23934-f3"></graphic>
</fig>
<table-wrap position="float" id="t1">
<label>Table 1</label>
<caption>
<title>Results of different methods for recombination spot identification.</title>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center" char="."></col>
<col align="center" char="."></col>
<col align="center" char="."></col>
<col align="center" char="."></col>
</colgroup>
<thead valign="bottom">
<tr>
<th align="left" valign="top" charoff="50">Predictor</th>
<th align="center" valign="top" char="." charoff="50">Sn(%)</th>
<th align="center" valign="top" char="." charoff="50">Sp(%)</th>
<th align="center" valign="top" char="." charoff="50">Acc(%)</th>
<th align="center" valign="top" char="." charoff="50">MCC</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="top" charoff="50">SVM-GKM
<xref ref-type="fn" rid="t1-fn1">a</xref>
</td>
<td align="center" valign="top" char="." charoff="50">81.22</td>
<td align="center" valign="top" char="." charoff="50">90.69</td>
<td align="center" valign="top" char="." charoff="50">86.57</td>
<td align="center" valign="top" char="." charoff="50">0.728</td>
</tr>
<tr>
<td align="left" valign="top" charoff="50">iRSpot-PseDNC
<xref ref-type="fn" rid="t1-fn2">b</xref>
</td>
<td align="center" valign="top" char="." charoff="50">81.63</td>
<td align="center" valign="top" char="." charoff="50">88.14</td>
<td align="center" valign="top" char="." charoff="50">85.19</td>
<td align="center" valign="top" char="." charoff="50">0.692</td>
</tr>
<tr>
<td align="left" valign="top" charoff="50">IDQD
<xref ref-type="fn" rid="t1-fn3">c</xref>
</td>
<td align="center" valign="top" char="." charoff="50">79.40</td>
<td align="center" valign="top" char="." charoff="50">81.00</td>
<td align="center" valign="top" char="." charoff="50">80.30</td>
<td align="center" valign="top" char="." charoff="50">0.603</td>
</tr>
<tr>
<td align="left" valign="top" charoff="50">kmer-SVM
<xref ref-type="fn" rid="t1-fn4">d</xref>
</td>
<td align="center" valign="top" char="." charoff="50">74.49</td>
<td align="center" valign="top" char="." charoff="50">84.75</td>
<td align="center" valign="top" char="." charoff="50">82.31</td>
<td align="center" valign="top" char="." charoff="50">0.597</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="t1-fn1">
<p>
<sup>a</sup>
The parameters used:
<italic>k</italic>
 = 13 and
<italic>m</italic>
 = 7.</p>
</fn>
<fn id="t1-fn2">
<p>
<sup>b</sup>
From Chen
<italic>et al.</italic>
<xref ref-type="bibr" rid="b53">53</xref>
.</p>
</fn>
<fn id="t1-fn3">
<p>
<sup>c</sup>
From Liu
<italic>et al.</italic>
<xref ref-type="bibr" rid="b17">17</xref>
.</p>
</fn>
<fn id="t1-fn4">
<p>
<sup>d</sup>
The parameter used:
<italic>k</italic>
 = 10.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="t2">
<label>Table 2</label>
<caption>
<title>Comparison of the most discriminative gapped k-mer with two known motifs in hotspot sequences.</title>
</caption>
<table frame="hsides" rules="groups" border="1">
<colgroup>
<col align="left"></col>
<col align="center"></col>
<col align="center"></col>
</colgroup>
<thead valign="bottom">
<tr>
<th align="left" valign="top" charoff="50">Motifs name
<xref ref-type="fn" rid="t2-fn1">a</xref>
</th>
<th align="center" valign="top" charoff="50">Sequence</th>
<th align="center" valign="top" charoff="50">Matching bases</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left" valign="top" charoff="50">M26</td>
<td align="center" valign="top" charoff="50">A
<bold>TGACGTCAT</bold>
</td>
<td align="center" valign="top" charoff="50">CCG
<bold>*T**C**CA*</bold>
</td>
</tr>
<tr>
<td align="left" valign="top" charoff="50">4095</td>
<td align="center" valign="top" charoff="50">
<bold>GGTCTRGAC</bold>
</td>
<td align="center" valign="top" charoff="50">CC
<bold>G*T**C**C</bold>
A*</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="t2-fn1">
<p>
<sup>a</sup>
These two motifs in hotspots are reported by
<xref ref-type="bibr" rid="b57">57</xref>
. The gapped k-mer ‘CCG*T**C**CA*’ with top discriminative power matches these two motifs. The matching bases are shown in bold.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</floats-group>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000144 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000144 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:4814916
   |texte=   Recombination spot identification Based on gapped k-mers
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:27030570" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021