MersV1, Pmc, Corpus, bibRecord, 000195

Selecting Accurate Classifier Models for a MERS-CoV Dataset

Identifieur interne : 000195 ( Pmc/Corpus ); précédent : 000194; suivant : 000196

Selecting Accurate Classifier Models for a MERS-CoV Dataset

Auteurs :

Source :

Intelligent Systems and Applications ; 2018.

RBID : PMC:7123473

Abstract

The Middle East Respiratory Syndrome Coronavirus (MERS-CoV) is a viral respiratory disease that is spreading worldwide necessitating to have an accurate diagnosis system that accurately predicts infections. As data mining classifiers can greatly assist in enhancing the prediction accuracy of diseases in general. In this paper, classifier model performance for two classification types: (1) binary and (2) multi-class were tested on a MERS-CoV dataset that consists of all reported cases in Saudi Arabia between 2013 and 2017. A cross-validation model was applied to measure the accuracy of the Support Vector Machine (SVM), Decision Tree, and k-Nearest Neighbor (k-NN) classifiers. Experimental results demonstrate that SVM and Decision Tree classifiers achieved the highest accuracy of 86.44% for binary classification based on healthcare personnel class. On the other hand, for multiclass classification based on city class, the decision tree classifier had the highest accuracy among the remaining classifiers; although it did not reach a satisfactory accuracy level (42.80%). This work is intended to be a part of a MERS-CoV prediction system to enhance the diagnosis of MERS-CoV disease.

Url:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7123473

DOI: 10.1007/978-3-030-01054-6_74
PubMed: NONE
PubMed Central: 7123473

Links to Exploration step

PMC:7123473

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Selecting Accurate Classifier Models for a MERS-CoV Dataset</title>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmc">7123473</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7123473</idno>
<idno type="RBID">PMC:7123473</idno>
<idno type="doi">10.1007/978-3-030-01054-6_74</idno>
<idno type="pmid">NONE</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000195</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000195</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Selecting Accurate Classifier Models for a MERS-CoV Dataset</title>
</analytic>
<series><title level="j">Intelligent Systems and Applications</title>
<imprint><date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p id="Par1">The Middle East Respiratory Syndrome Coronavirus (MERS-CoV) is a viral respiratory disease that is spreading worldwide necessitating to have an accurate diagnosis system that accurately predicts infections. As data mining classifiers can greatly assist in enhancing the prediction accuracy of diseases in general. In this paper, classifier model performance for two classification types: (1) binary and (2) multi-class were tested on a MERS-CoV dataset that consists of all reported cases in Saudi Arabia between 2013 and 2017. A cross-validation model was applied to measure the accuracy of the Support Vector Machine (SVM), Decision Tree, and k-Nearest Neighbor (k-NN) classifiers. Experimental results demonstrate that SVM and Decision Tree classifiers achieved the highest accuracy of 86.44% for binary classification based on healthcare personnel class. On the other hand, for multiclass classification based on city class, the decision tree classifier had the highest accuracy among the remaining classifiers; although it did not reach a satisfactory accuracy level (42.80%). This work is intended to be a part of a MERS-CoV prediction system to enhance the diagnosis of MERS-CoV disease.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Koh, Hc" uniqKey="Koh H">HC Koh</name>
</author>
<author><name sortKey="Tan, G" uniqKey="Tan G">G Tan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Yoo" uniqKey="Yoo">Yoo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Al Turaiki, Isra" uniqKey="Al Turaiki I">Isra Al-Turaiki</name>
</author>
<author><name sortKey="Alshahrani, Mona" uniqKey="Alshahrani M">Mona Alshahrani</name>
</author>
<author><name sortKey="Almutairi, Tahani" uniqKey="Almutairi T">Tahani Almutairi</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Asri, H" uniqKey="Asri H">H Asri</name>
</author>
<author><name sortKey="Mousannif, H" uniqKey="Mousannif H">H Mousannif</name>
</author>
<author><name sortKey="Moatassime, Ha" uniqKey="Moatassime H">HA Moatassime</name>
</author>
<author><name sortKey="Noel, T" uniqKey="Noel T">T Noel</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author><name sortKey="Zhao, Z" uniqKey="Zhao Z">Z Zhao</name>
</author>
<author><name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author><name sortKey="Cheng, Z" uniqKey="Cheng Z">Z Cheng</name>
</author>
<author><name sortKey="Wang, X" uniqKey="Wang X">X Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Sandhu, R" uniqKey="Sandhu R">R Sandhu</name>
</author>
<author><name sortKey="Sood, Sk" uniqKey="Sood S">SK Sood</name>
</author>
<author><name sortKey="Kaur, G" uniqKey="Kaur G">G Kaur</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Jang, Seongpil" uniqKey="Jang S">Seongpil Jang</name>
</author>
<author><name sortKey="Lee, Seunghwan" uniqKey="Lee S">Seunghwan Lee</name>
</author>
<author><name sortKey="Choi, Seong Min" uniqKey="Choi S">Seong-Min Choi</name>
</author>
<author><name sortKey="Seo, Junwon" uniqKey="Seo J">Junwon Seo</name>
</author>
<author><name sortKey="Choi, Hunseok" uniqKey="Choi H">Hunseok Choi</name>
</author>
<author><name sortKey="Yoon, Taeseon" uniqKey="Yoon T">Taeseon Yoon</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Witten, H" uniqKey="Witten H">H Witten</name>
</author>
<author><name sortKey="Frank, E" uniqKey="Frank E">E Frank</name>
</author>
<author><name sortKey="Hall, Ma" uniqKey="Hall M">MA Hall</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Stehman, Sv" uniqKey="Stehman S">SV Stehman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sokolova, M" uniqKey="Sokolova M">M Sokolova</name>
</author>
<author><name sortKey="Lapalme, G" uniqKey="Lapalme G">G Lapalme</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="chapter-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="publisher-id">978-3-030-01054-6</journal-id>
<journal-id journal-id-type="doi">10.1007/978-3-030-01054-6</journal-id>
<journal-id journal-id-type="nlm-ta">Intelligent Systems and Applications</journal-id>
<journal-title-group><journal-title>Intelligent Systems and Applications</journal-title>
<journal-subtitle>Proceedings of the 2018 Intelligent Systems Conference (IntelliSys) Volume 1</journal-subtitle>
</journal-title-group>
<isbn publication-format="print">978-3-030-01053-9</isbn>
<isbn publication-format="electronic">978-3-030-01054-6</isbn>
</journal-meta>
<article-meta><article-id pub-id-type="pmc">7123473</article-id>
<article-id pub-id-type="publisher-id">74</article-id>
<article-id pub-id-type="doi">10.1007/978-3-030-01054-6_74</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>Selecting Accurate Classifier Models for a MERS-CoV Dataset</article-title>
</title-group>
<contrib-group content-type="book editors"><contrib contrib-type="editor"><name><surname>Arai</surname>
<given-names>Kohei</given-names>
</name>
<address><email>arai@is.saga-u.ac.jp</email>
</address>
<xref ref-type="aff" rid="Aff14">14</xref>
</contrib>
<contrib contrib-type="editor"><name><surname>Kapoor</surname>
<given-names>Supriya</given-names>
</name>
<address><email>supriya.kapoor@thesai.org</email>
</address>
<xref ref-type="aff" rid="Aff15">15</xref>
</contrib>
<contrib contrib-type="editor"><name><surname>Bhatia</surname>
<given-names>Rahul</given-names>
</name>
<address><email>rahul.bhatia@thesai.org</email>
</address>
<xref ref-type="aff" rid="Aff16">16</xref>
</contrib>
<aff id="Aff14"><label>14</label>
<institution-wrap><institution-id institution-id-type="GRID">grid.412339.e</institution-id>
<institution-id institution-id-type="ISNI">0000 0001 1172 4459</institution-id>
<institution>Faculty of Science and Engineering,</institution>
<institution>Saga University,</institution>
</institution-wrap>
Saga, Japan</aff>
<aff id="Aff15"><label>15</label>
<institution-wrap><institution-id institution-id-type="GRID">grid.473726.3</institution-id>
<institution>The Science and Information (SAI) Organization,</institution>
</institution-wrap>
Bradford, UK</aff>
<aff id="Aff16"><label>16</label>
<institution-wrap><institution-id institution-id-type="GRID">grid.473726.3</institution-id>
<institution>The Science and Information (SAI) Organization,</institution>
</institution-wrap>
Bradford, UK</aff>
</contrib-group>
<contrib-group><contrib contrib-type="author" corresp="yes"><name><surname>AlMoammar</surname>
<given-names>Afnan</given-names>
</name>
<address><email>437203909@student.ksu.edu.sa</email>
</address>
<xref ref-type="aff" rid="Aff17"></xref>
</contrib>
<contrib contrib-type="author"><name><surname>AlHenaki</surname>
<given-names>Lubna</given-names>
</name>
<address><email>437204268@student.ksu.edu.sa</email>
</address>
<xref ref-type="aff" rid="Aff17"></xref>
</contrib>
<contrib contrib-type="author"><name><surname>Kurdi</surname>
<given-names>Heba</given-names>
</name>
<address><email>hkurdi@ksu.edu.sa</email>
</address>
<xref ref-type="aff" rid="Aff17"></xref>
</contrib>
<aff id="Aff17"><institution-wrap><institution-id institution-id-type="GRID">grid.56302.32</institution-id>
<institution-id institution-id-type="ISNI">0000 0004 1773 5396</institution-id>
<institution>Computer Science Department,</institution>
<institution>KSU,</institution>
</institution-wrap>
KSA Riyadh, Saudi Arabia</aff>
</contrib-group>
<pub-date pub-type="epub"><day>09</day>
<month>11</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="collection"><year>2019</year>
</pub-date>
<volume>868</volume>
<fpage>1070</fpage>
<lpage>1084</lpage>
<permissions><copyright-statement>© Springer Nature Switzerland AG 2019</copyright-statement>
<license><license-p>This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.</license-p>
</license>
</permissions>
<abstract id="Abs1"><p id="Par1">The Middle East Respiratory Syndrome Coronavirus (MERS-CoV) is a viral respiratory disease that is spreading worldwide necessitating to have an accurate diagnosis system that accurately predicts infections. As data mining classifiers can greatly assist in enhancing the prediction accuracy of diseases in general. In this paper, classifier model performance for two classification types: (1) binary and (2) multi-class were tested on a MERS-CoV dataset that consists of all reported cases in Saudi Arabia between 2013 and 2017. A cross-validation model was applied to measure the accuracy of the Support Vector Machine (SVM), Decision Tree, and k-Nearest Neighbor (k-NN) classifiers. Experimental results demonstrate that SVM and Decision Tree classifiers achieved the highest accuracy of 86.44% for binary classification based on healthcare personnel class. On the other hand, for multiclass classification based on city class, the decision tree classifier had the highest accuracy among the remaining classifiers; although it did not reach a satisfactory accuracy level (42.80%). This work is intended to be a part of a MERS-CoV prediction system to enhance the diagnosis of MERS-CoV disease.</p>
</abstract>
<kwd-group xml:lang="en"><title>Keywords</title>
<kwd>Data mining</kwd>
<kwd>Medical data</kwd>
<kwd>Classification</kwd>
<kwd>Classifier model</kwd>
<kwd>MERS-CoV</kwd>
<kwd>Accuracy measurement</kwd>
<kwd>Cross-validation model</kwd>
</kwd-group>
<custom-meta-group><custom-meta><meta-name>issue-copyright-statement</meta-name>
<meta-value>© Springer Nature Switzerland AG 2019</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body><sec id="Sec1"><title>Introduction</title>
<p id="Par2">Middle East respiratory syndrome (MERS) is a viral respiratory disease that spread over 27 countries around the world. The disease was caused by a novel coronavirus called the Middle East respiratory syndrome coronavirus (MERS-CoV). Moreover, coronaviruses are a large family of viruses responsible for causing many diseases, from mild colds to Severe Acute Respiratory Syndrome (SARS). MERS-CoV is one of the most common major causes for the increase in mortality among children and adults in the world [<xref ref-type="bibr" rid="CR1">1</xref>
]. The first identification of MERS-CoV was in Saudi Arabia in 2012. It spread rapidly in Saudi Arabia and many other countries and caused a large number of deaths [<xref ref-type="bibr" rid="CR2">2</xref>
]. Therefore, early diagnosis of MERS-CoV infection may help to control the outbreak of the virus and reduce human suffering. Computer and data mining techniques can provide great help in analyzing, diagnosing, and predicting diseases, and they can assist in controlling virus infection [<xref ref-type="bibr" rid="CR3">3</xref>
].</p>
<p id="Par3">Using data mining techniques in diagnosis and prediction of diseases has been developing fast over the last few decades. Data mining is the process of analyzing a large amount of complex data to find useful patterns and extract hidden information by applying machine learning algorithms [<xref ref-type="bibr" rid="CR4">4</xref>
]. In healthcare, the generated data is vast and too complex to be analyzed and processed by traditional methods. Due to this, the need for data mining in healthcare is becoming essential. Accordingly, data mining has been widely used in healthcare, including outcomes prediction, treatment effectiveness evaluation, infection control, and disease diagnosis [<xref ref-type="bibr" rid="CR3">3</xref>
]. Moreover, studies on using data mining in healthcare show that it succeeds in helping to improve diagnostic accuracy prediction and predicting health insurance fraud, which lightens the burden of increasing workloads, and reducing healthcare costs [<xref ref-type="bibr" rid="CR5">5</xref>
].</p>
<p id="Par4">Recently, various types of data mining methods have been applied by a number of researchers [<xref ref-type="bibr" rid="CR6">6</xref>
, <xref ref-type="bibr" rid="CR7">7</xref>
], using real MERS-CoV datasets based on several types of machine learning classifiers. MERS is a complex disease caused by MERS-CoV that spreads easily and has a high death rate; approximately 40% of patients diagnosed with MERS have died [<xref ref-type="bibr" rid="CR1">1</xref>
]. The challenge remains to provide prediction systems that accurately anticipate and diagnose MERS-CoV. Prediction systems are primarily motivated by the necessity of achieving maximum possible accuracy. Our motivation for this study is to utilize data mining techniques in order to control the spreading of MERS-CoV and to save people’s lives. Motivated by the above needs, we make the following contribution in the application of classification algorithms to a MERS-CoV dataset for identifying the accurate classifier.</p>
<p id="Par5">The main contribution of this study is to apply a support vector machine classifier beside two other classifiers to assess the classification accuracy on MERS-CoV dataset. Whilst the previous studies used datasets consisting of information about MERS-CoV cases only up to 2015, our dataset covers all affected cases in Saudi Arabia from 2013 to 2017.</p>
<p id="Par6">The remaining parts of this paper are organized as follows: The literature review is introduced in Section <xref rid="Sec2" ref-type="sec">2</xref>
. Then, the system design and implementation are presented in Section <xref rid="Sec3" ref-type="sec">3</xref>
. The methodology is then described in Section <xref rid="Sec4" ref-type="sec">4</xref>
. After that, the results and discussion are detailed in Section <xref rid="Sec7" ref-type="sec">5</xref>
. Finally, the conclusions and directions for future work are discussed.</p>
</sec>
<sec id="Sec2"><title>Literature Review</title>
<p id="Par7">One of the early applications of data mining techniques was in medical areas where it could help in predicting and diagnosing diseases and support medical decision making. Several researchers have been working in data mining application and experimental use of medical datasets. This review will go through some of the related work in healthcare, but it is not meant to be exhaustive. The first part of the literature review introduces some applications of classification algorithms on different medical datasets. The second part is a review of the related works of the MERS-CoV diagnosis and prediction using data mining techniques.</p>
<p id="Par8">For instance, the researchers in [<xref ref-type="bibr" rid="CR8">8</xref>
], apply data mining on historical health records to improve the prediction of chronic disease. In this study, two datasets from UC Irvine (UCI) repository are considered: heart disease and diabetes. Many data mining algorithms are applied, including: Naïve Bayes, Decision Tree, Support Vector Machine (SVM), and Artificial Neural Networks (ANN). From the experiment, SVM performs better than the other classifiers on the heart disease dataset, while Naïve Bayes classifier achieves the highest accuracy on the diabetes dataset.</p>
<p id="Par9">A recent study [<xref ref-type="bibr" rid="CR9">9</xref>
] uses data mining to increase the diagnosis of neonatal jaundice in newborns. The dataset consists of records of healthy newborn infants with 35 or more weeks of gestation collected from the Obstetrics Department of the Centro Hospital. Several data mining algorithms are applied to the dataset: Decision Tree, CART, Naïve Bayes, Artificial Neural Networks, SVM, and Easy Logistic algorithms. The results of this study show that the most effective predictive models are Naïve Bayes, Neural Networks, and Easy Logistic algorithms.</p>
<p id="Par10">The researchers in [<xref ref-type="bibr" rid="CR10">10</xref>
] compare different data mining algorithms to find the most efficient and effective algorithm in terms of accuracy, sensitivity, and precision. An experiment is conducted using an original Wisconsin Breast Cancer dataset from the UCI machine learning repository with four classifiers: SVM, Naïve Bayes, Decision Tree, and k-Nearest Neighbor (k-NN). The effectiveness of all classifiers is evaluated in terms of time to build the model, correctly classified instances, incorrectly classified instances, and accuracy. The results show that SVM is the most efficient classifier in Breast Cancer prediction and diagnosis with high precision and low error rate.</p>
<p id="Par11">Another study [<xref ref-type="bibr" rid="CR11">11</xref>
], applies different machine learning algorithms on artificial lung cancer datasets systematically collected by the Hospital Information System in order to explore the advantages and disadvantages of each algorithm. Many experiments are conducted on the dataset using the following machine learning algorithms: Decision Tree, Bagging, Adaboost, SVM, k-NN, and Neural Network. The results show that, according to the high accuracy of these algorithms, Adaboost and Neural Network are suitable for this type of cancer analysis.</p>
<p id="Par12">The researchers in [<xref ref-type="bibr" rid="CR12">12</xref>
] compare two classification algorithms: Decision Trees and Random Forest with Self-Organizing Map (SOM) to build a predictive model for diabetic patients. The dataset uses in this study is collected from the Hospital Information System of the Ministry of National Guard Health Affairs (MNGHA), Saudi Arabia, between 2013 and 2015. The authors found that the Random Forest algorithm achieves the highest recall and precision.</p>
<p id="Par13">The authors in [<xref ref-type="bibr" rid="CR13">13</xref>
] introduce a MobDBTest Android mobile application. MobDBTest uses machine learning techniques to predict diabetes levels for the users. The propos Android mobile application is tested on real dataset collected from a reputed hospital in the Chhattisgarh state of India. Four machine learning algorithms such as J48, Naïve Bayes, SVM and Multilayer Perceptron are used to classify the collected data. The results show that J48 algorithm outperformed other methods in terms of sensitivity, specificity and ROC areas.</p>
<p id="Par14">During the past six years, more information about the MERS-CoV disease has become available to the public. MERS-CoV is a well-known virus that is still rapidly growing. Finding the accurate classifier can help to improve the prediction accuracy of MERS-CoV infection. The study in [<xref ref-type="bibr" rid="CR7">7</xref>
] applies data mining techniques to a MERS-CoV dataset to identify the accurate classifier models of binary, multi-class, and multi-label classification. The dataset includes all MERS-CoV cases in Saudi Arabia from the Saudi Ministry of Health from 2013 to the second half of 2016. Three classifier models are built using k-NN, Decision Tree, and Naïve Bayes algorithms. The outcome of this research is that the Decision Tree is the most accurate algorithm for the binary-class classification, whereas k-NN is the most accurate algorithm for the multi-class classification. Additionally, for the multi-label classification the Naïve Bayes is the most accurate algorithm.</p>
<p id="Par15">Another related study [<xref ref-type="bibr" rid="CR6">6</xref>
], involves experimental data mining to build prediction models for MERS-CoV. The experiments are conducted on a dataset collected from the Saudi Ministry of Health. It consists of MERS-CoV cases between 2013 and 2015. The Naive Bayes and Decision Tree algorithms are used to develop recovery and stability predictive models based on the MERS-CoV dataset. The results of recovery models indicate that healthcare workers are more likely to survive. Moreover, symptoms and age are important attributes for predicting stability in stability models. In general, Decision Tree has better accuracy over all models.</p>
<p id="Par16">The researchers in [<xref ref-type="bibr" rid="CR14">14</xref>
] propose a molecular approach to analyze DNA sequences of MERS-CoV to draw the route of transmission of MERS-CoV from Saudi Arabia to the world. Full DNA sequences that are collected from 15 different regions from the National Center for Biotechnology Information (NCBI) are converted into amino acid sequences to be used in the analysis process. Moreover, the proposed approach uses Apriori and Decision Tree algorithms to find the similarities and differences between different amino acid sequences. Relevance between several sequences is found using Decision Tree algorithm.</p>
<p id="Par17">The study described in [<xref ref-type="bibr" rid="CR15">15</xref>
] proposes a cloud-based MERS-CoV prediction system to predict and prevent MERS-CoV infection spread between citizens and regions. The dataset consists of patients, medicines, and reports of each user. It is stored in multiple clouds known as a medical record (M.R.) database. In addition, this system is based on a statistical classifier in data mining, which is a Bayesian classification algorithm for initial classification of the patient base on predicting class membership probabilities. The outcome of this study is a prediction of MERS-CoV-infected regions on Google Maps with high accuracy in the classification.</p>
<p id="Par18">A study [<xref ref-type="bibr" rid="CR16">16</xref>
] applies three data mining algorithms to compare two viruses with similar symptoms: severe acute respiratory syndrome (SARS) and MERS coronaviruses. Apriori, Decision Tree, and SVM data mining algorithms are used on data of the spike in glycoprotein from the NCBI to distinguish between the two viruses. From the experiment, it is clear that distinguishes between MERS and SARS spike glycoproteins with a high accuracy.</p>
<p id="Par19">Table <xref rid="Tab1" ref-type="table">1</xref>
 presents a comparison of literature review that applied data mining techniques on medical data over the different categories. These categories are reference number, used data mining algorithm, used dataset, the objective of the research, used tool, and finally the outcome of the research. From the Table <xref rid="Tab1" ref-type="table">1</xref>
, it can be seen that several algorithms and techniques have been applied to medical datasets and that the most common methods for classification are Decision Tree, SVM, and k-NN algorithms.<table-wrap id="Tab1"><label>Table 1.</label>
<caption><p>Comparison of relevant literature review</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Ref. no.</th>
<th align="left">Data mining techniques</th>
<th align="left">Dataset</th>
<th align="left">Objective</th>
<th align="left">Tool</th>
<th align="left">Outcomes</th>
</tr>
</thead>
<tbody><tr><td align="left">[<xref ref-type="bibr" rid="CR8">8</xref>
]</td>
<td align="left">Naïve Bayes, Decision Tree, SVM, and ANN</td>
<td align="left">Heart disease, and diabetes datasets</td>
<td align="left">Predicting Chronic disease by mining the data containing historical health records</td>
<td align="left">WEKA</td>
<td align="left">SVM gives highest accuracy rate of 95.55% and Naïve Bayes classifier gives highest accuracy of 73.58% for the heart disease, diabetes respectively</td>
</tr>
<tr><td align="left">[<xref ref-type="bibr" rid="CR9">9</xref>
]</td>
<td align="left">Decision Tree, CART, Trusting Bayes classifier, neural networks SMO, and easy logistic</td>
<td align="left">Records of Healthy newborn infants with 35 or more weeks of gestation</td>
<td align="left">Improving the diagnosis of neonatal jaundice in newborns</td>
<td align="left">WEKA</td>
<td align="left">The most effective predictive models are Trusting Bayes with 88% accuracy, neural networks with 87% accuracy, and easy logistic with 89% accuracy</td>
</tr>
<tr><td align="left">[<xref ref-type="bibr" rid="CR10">10</xref>
]</td>
<td align="left">SVM, C4.5, Naive Bayes, and <italic>k</italic>
-NN</td>
<td align="left">Wisconsin Breast Cancer (original) dataset</td>
<td align="left">Finding the most efficient algorithm for Breast Cancer prediction and diagnosis</td>
<td align="left">WEKA</td>
<td align="left">The SVM has proven its efficiency in Breast Cancer prediction and diagnosis with 97.13% accuracy</td>
</tr>
<tr><td align="left">[<xref ref-type="bibr" rid="CR11">11</xref>
]</td>
<td align="left">Decision Tree, Bagging, Adaboost, SVM, <italic>k</italic>
-NN, and Neural Network</td>
<td align="left">Artificial lung cancer dataset</td>
<td align="left">Comparing different classification algorithms in order to explore the advantages and disadvantages of each one</td>
<td align="left">RStudio</td>
<td align="left">Adaboost algorithm and neural network algorithm have relative high accuracy with 97.5% accuracy</td>
</tr>
<tr><td align="left">[<xref ref-type="bibr" rid="CR12">12</xref>
]</td>
<td align="left">Self-Organizing Map (SOM), Decision Tees C4.5, and Random Forest</td>
<td align="left">Adult population data</td>
<td align="left">Constructing intelligent predictive model for diabetic disease by using real healthcare data</td>
<td align="left">RStudio and WEKA</td>
<td align="left">The RandomForest model could assist health care providers with 90% accuracy to make better clinical decisions in identifying diabetic patients</td>
</tr>
<tr><td align="left">[<xref ref-type="bibr" rid="CR13">13</xref>
]</td>
<td align="left">J48, Naïve Bayes, SVM and Multilayer Perceptron</td>
<td align="left">Reputed hospital in the Chhattisgarh state of India</td>
<td align="left">Predict diabetes levels for the users uses machine learning techniques</td>
<td align="left">Android mobile application</td>
<td align="left">The results show that J48 algorithm outperformed other methods in terms of sensitivity, specificity and ROC areas</td>
</tr>
<tr><td align="left">[<xref ref-type="bibr" rid="CR7">7</xref>
]</td>
<td align="left"><italic>k</italic>
-NN, Decision Tree, and Naïve Bayes algorithms</td>
<td align="left">MERS-CoV cases in Saudi Arabia noted between 2013 and second half of 2016</td>
<td align="left">Identifying accurate classifier modes for binary, multiclass, and multi-label classification of a text-based MERS-CoV dataset</td>
<td align="left">RapidMiner Studio</td>
<td align="left">The accurate algorithm for the Binary-Class classification is Decision Tree with 90% accuracy, for the Multi-Class classification is <italic>k</italic>
-NN with 51.60% accuracy, and for the Multi-Label classification is Naïve Bayes with 77% accuracy</td>
</tr>
<tr><td align="left">[<xref ref-type="bibr" rid="CR6">6</xref>
]</td>
<td align="left">Naive Bayes, and Decision Tree algorithms</td>
<td align="left">1082 records of MERS-CoV cases noted between 2013 and 2015</td>
<td align="left">Building predictive models for MERS-CoV infection to understand which factors contribute to complications of this infection</td>
<td align="left">WEKA</td>
<td align="left">The results show that, Decision Tree classifier has better accuracy of 55.69%, and 68% for the stability and the recovery models respectively</td>
</tr>
<tr><td align="left">[<xref ref-type="bibr" rid="CR14">14</xref>
]</td>
<td align="left">Decision Tree, and Apriori Algorithms</td>
<td align="left">DNA sequences of MERS-CoV outbreak from different regions in the world where the viruses speared</td>
<td align="left">Finding the similarities between different MERS-CoV amino acid sequences to know transmission route of MERS-CoV</td>
<td align="left">Mathematical model</td>
<td align="left">The results show that Riyadh, Makkah, and Buridah regions of MERS-CoV transmission in Saudi Arabia</td>
</tr>
<tr><td align="left">[<xref ref-type="bibr" rid="CR15">15</xref>
]</td>
<td align="left">BBN classification</td>
<td align="left">Multiple attributes: 1-personal (static), 2-MERS (changes over time)</td>
<td align="left">Identifying an intelligent system for predicting and preventing MERS-CoV infection</td>
<td align="left">R Studio, WEKA, and Amazon EC2</td>
<td align="left">The BBN achieve an accuracy of 83.1% on synthetic data</td>
</tr>
<tr><td align="left">[<xref ref-type="bibr" rid="CR16">16</xref>
]</td>
<td align="left">Decision Tree, and Apriori Algorithms</td>
<td align="left">DNA sequences of MERS-CoV of outbreak</td>
<td align="left">Finding the similarities between different MERS-CoV amino acid sequences to know transmission route of MERS-CoV</td>
<td align="left">Mathematical model</td>
<td align="left">The results show that Riyadh, Makkah, and Buridah regions of MERS-CoV transmission in Saudi Arabia</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par20">In conclusion, in related studies data mining is widely used for the prognoses and diagnoses of many diseases. However, the datasets used in [<xref ref-type="bibr" rid="CR6">6</xref>
, <xref ref-type="bibr" rid="CR7">7</xref>
] are limited and include the MERS-CoV cases in Saudi Arabia from 2013–2015 only. It is important to increase the size of the dataset to cover new cases. Therefore, this study applied Data mining techniques using Decision Tree, SVM, and k-NN classification algorithms to a real dataset of MERS-CoV cases in the Kingdom of Saudi Arabia that was collected during 2013–2017.</p>
</sec>
<sec id="Sec3"><title>System Design and Implementation</title>
<p id="Par21">The system overview is illustrated in Fig. <xref rid="Fig1" ref-type="fig">1</xref>
. It shows high-level components of the classification framework. The classification framework is composed of three subsystems, which are the MERS-CoV dataset, supervised learning, and data scientist. The MERS-CoV dataset subsystem aims to collect MERS-CoV data from different sources and integrate them into one database. The purpose of the supervised learning subsystem, which is the core of this study, is applying data mining techniques to build three different classifier models. Finally, the third subsystem consists of data scientists who analyze data and evaluate results.<fig id="Fig1"><label>Fig. 1.</label>
<caption><p>System overview.</p>
</caption>
<graphic xlink:href="473257_1_En_74_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p id="Par22">Figure <xref rid="Fig2" ref-type="fig">2</xref>
 shows the overall workflow of the classification framework, which is divided into two main phases. The first phase aims to collect data of patients who are affected by MERS-CoV from different cities in Saudi Arabia between January 2013 and October 2017. The second phase is the most important phase. Its purpose is to identify the classifier model and evaluate the classification accuracy using cross validation test mode.<fig id="Fig2"><label>Fig. 2.</label>
<caption><p>System workflow.</p>
</caption>
<graphic xlink:href="473257_1_En_74_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
</sec>
<sec id="Sec4"><title>Methodology</title>
<sec id="Sec5"><title>Dataset Description and Pre-processing</title>
<p id="Par23">As mentioned, the dataset used in this study covers all MERS-CoV cases in Saudi Arabia, including 1,186 alive records and 224 death records, which were reported between 2013 and 2017. The dataset of MERS-CoV cases from 2013–2015 was obtained by a request from [<xref ref-type="bibr" rid="CR2">2</xref>
], the 2016–2017 dataset was collected from the website of the World Health Organization [<xref ref-type="bibr" rid="CR2">2</xref>
]. Moreover, The MERS-CoV dataset consists of the following information about MERS-CoV patients: gender, age, exposure to camels, comorbidities, exposure to MERS-CoV cases, city, and whether the patient is employed in healthcare or not. In addition, the dataset contains information about status to detect whether the patient is alive or dead.</p>
<p id="Par24">The challenge of building the dataset was that data were published on the website as text description of details of the MERS-CoV cases, represented in Fig. <xref rid="Fig3" ref-type="fig">3</xref>
, was not promptly usable by any data mining tool. This compelled us to construct the dataset from scratch. Furthermore, all records were prepared in Comma Separated Value (CSV) format, which is appropriate for a data mining tool.<fig id="Fig3"><label>Fig. 3.</label>
<caption><p>A sample of text description of MERS-CoV cases.</p>
</caption>
<graphic xlink:href="473257_1_En_74_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
<p id="Par25">In order to enhance the quality of the classification framework, different preprocessing techniques were applied to the MERS-CoV dataset, including replacing missing values and reducing noise values. To handle the missing values, each was replaced with the mean of the attribute that includes missing values. Additionally, the noise in the dataset appeared due to the existence of inconsistent data in the dataset. For instance, the gender attribute is represented in some instances using the full word “female” or “male,” while in other instances it is represented using the abbreviations “F” or “M.” So, the inconsistent values were integrated into a standard value which is “F or M.” Furthermore, the data were converted from categorical to numerical data because the SVM algorithm deals only with numerical data.</p>
</sec>
<sec id="Sec6"><title>Data Mining</title>
<p id="Par26">In a similar approach to [<xref ref-type="bibr" rid="CR7">7</xref>
], in this study, the classifier model performance was examined for two classification types: binary classification based on the healthcare personnel class, and multi-class classification based on the city class. The SVM, Decision Tree, and k-NN classification algorithms were chosen because they are commonly used for medical mining as presented in the Literature Review section of this paper. In addition, they outperform other applied algorithms, as shown in [<xref ref-type="bibr" rid="CR6">6</xref>
, <xref ref-type="bibr" rid="CR12">12</xref>
, <xref ref-type="bibr" rid="CR13">13</xref>
]. Furthermore, SVM was used in this study because it was not applied to MERS-CoV dataset in recent studies [<xref ref-type="bibr" rid="CR6">6</xref>
, <xref ref-type="bibr" rid="CR7">7</xref>
]. Also, the Decision Tree and k-NN classifiers in [<xref ref-type="bibr" rid="CR7">7</xref>
] achieved the highest accuracy on binary-class and multi-class classification respectively. Additionally, based on the structure of the dataset, each class was chosen to represent a different classification type.</p>
<p id="Par27">The software used in this study was RapidMiner Studio version 7.6. RapidMiner is an open-source data mining software tool written in Java programing language. It is issued under the Affero General Public License that provides an integrated environment for data mining and predictive analytics. Moreover, RapidMiner is used to perform machine learning algorithms for data mining tasks [<xref ref-type="bibr" rid="CR17">17</xref>
]. SVM, Decision Tree, and k-NN algorithms were applied on the MERS-CoV dataset using a RapidMiner tool. In sum, this experimental study was recreated six times.</p>
<p id="Par28">The essential parameters of the k-NN algorithm are k, which is determined by the numbers of nearest neighbors. The value of k was set to 5 because an odd value is recommended to prevent a tie, when two or more classes have the same number of votes. Additionally, the Euclidean distance function was used as a similarity measure between testing and training data [<xref ref-type="bibr" rid="CR4">4</xref>
, <xref ref-type="bibr" rid="CR18">18</xref>
]:<disp-formula id="Equ1"><label>1</label>
<alternatives><tex-math id="M1">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ Euclidean\,distance\, (x,x_{i} ) = \sqrt {\mathop \sum \nolimits_{i = 0}^{m} } (x - x_{i} )^{2} $$\end{document}</tex-math>
<graphic xlink:href="473257_1_En_74_Chapter_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par29">Where <inline-formula id="IEq1"><alternatives><tex-math id="M2">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ x $$\end{document}</tex-math>
<inline-graphic xlink:href="473257_1_En_74_Chapter_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
 is the testing point, and <inline-formula id="IEq2"><alternatives><tex-math id="M3">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ x_{i} $$\end{document}</tex-math>
<inline-graphic xlink:href="473257_1_En_74_Chapter_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
 is the training point.</p>
<p id="Par30">For the decision tree, gain ratio was used as the attribute selection method for splitting, because it measures the information that gained by each attribute easily and quickly. The information gain ratio calculated using the following formula [<xref ref-type="bibr" rid="CR4">4</xref>
, <xref ref-type="bibr" rid="CR18">18</xref>
]:<disp-formula id="Equ2"><label>2</label>
<alternatives><tex-math id="M4">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ Gain(A) = Info(D) - InfoA  (D) $$\end{document}</tex-math>
<graphic xlink:href="473257_1_En_74_Chapter_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where the Info (D) is the average amount of information needed to identify the class label of a tuple in D also, known as the entropy of D, and it is calculated by:<disp-formula id="Equ3"><label>3</label>
<alternatives><tex-math id="M5">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ Info\left( D \right) = \sum\nolimits_{i = 1}^{m} {{{\rm P}}_{{\rm i}}   {{\rm log}}_{2}  ({{\rm p}}_{{\rm i}} )} $$\end{document}</tex-math>
<graphic xlink:href="473257_1_En_74_Chapter_Equ3.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par32">Where <inline-formula id="IEq3"><alternatives><tex-math id="M6">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ {\text{P}}_{\text{i}} $$\end{document}</tex-math>
<inline-graphic xlink:href="473257_1_En_74_Chapter_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
 is the probability that an arbitrary tuple in D belongs to class C, where <inline-formula id="IEq4"><alternatives><tex-math id="M7">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ Info\left( D \right) $$\end{document}</tex-math>
<inline-graphic xlink:href="473257_1_En_74_Chapter_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
 is the expected information required to classify a tuple from D based on the partitioning by attribute A, it is calculated by:<disp-formula id="Equ4"><label>4</label>
<alternatives><tex-math id="M8">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ Info_{A}  (D) = \sum\nolimits_{{{\text{j}} = 1}}^{\text{v}} {\frac{{\left| {{\text{D}}_{\text{j}} } \right|}}{{\left| {\text{D}} \right|}} \times {\text{Info}}\left( {{\text{D}}_{\text{j}} } \right)} $$\end{document}</tex-math>
<graphic xlink:href="473257_1_En_74_Chapter_Equ4.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par33">The term <inline-formula id="IEq5"><alternatives><tex-math id="M9">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ \frac{{\left| {{\text{D}}_{\text{j}} } \right|}}{{\left| {\text{D}} \right|}} $$\end{document}</tex-math>
<inline-graphic xlink:href="473257_1_En_74_Chapter_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
 is the weight of the j<sup>th</sup>
 partition [<xref ref-type="bibr" rid="CR4">4</xref>
].</p>
<p id="Par34">The maximum depth of the Decision Tree was set to 20. Also, the Decision Tree was generated with a pruning function, which allows for reducing the size of the tree by removing low-power sub-trees.</p>
<p id="Par35">The essential parameter of SVM classifier is the kernel function. The most common kernel function that used with SVM classifier is linear kernel function [<xref ref-type="bibr" rid="CR6">6</xref>
] it defined as:<disp-formula id="Equ5"><label>5</label>
<alternatives><tex-math id="M10">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ F(x) = W. ( x_{i} , y_{i} ,) + b $$\end{document}</tex-math>
<graphic xlink:href="473257_1_En_74_Chapter_Equ5.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par36">Where <inline-formula id="IEq6"><alternatives><tex-math id="M11">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ X = \left( { x_{i} , y_{i} ,} \right) $$\end{document}</tex-math>
<inline-graphic xlink:href="473257_1_En_74_Chapter_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
 is the dataset, <inline-formula id="IEq7"><alternatives><tex-math id="M12">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ x_{i} $$\end{document}</tex-math>
<inline-graphic xlink:href="473257_1_En_74_Chapter_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
, is the instances, <inline-formula id="IEq8"><alternatives><tex-math id="M13">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ y_{i} $$\end{document}</tex-math>
<inline-graphic xlink:href="473257_1_En_74_Chapter_IEq8.gif"></inline-graphic>
</alternatives>
</inline-formula>
, is the class label and <inline-formula id="IEq9"><alternatives><tex-math id="M14">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ i = 1, 2, \ldots \ldots  n $$\end{document}</tex-math>
<inline-graphic xlink:href="473257_1_En_74_Chapter_IEq9.gif"></inline-graphic>
</alternatives>
</inline-formula>
 and W is the weight vector or the coefficient, and <inline-formula id="IEq10"><alternatives><tex-math id="M15">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ b $$\end{document}</tex-math>
<inline-graphic xlink:href="473257_1_En_74_Chapter_IEq10.gif"></inline-graphic>
</alternatives>
</inline-formula>
 is a scalar value called bias [<xref ref-type="bibr" rid="CR4">4</xref>
, <xref ref-type="bibr" rid="CR17">17</xref>
]. Other important parameters are the value of complexity constant (C), and the tolerance parameter. In this study, the kernel function that used was the linear function because the data is linearly separable, the parameter C was set to 0, and the tolerance parameter was set 0.001.</p>
<p id="Par37">Since the classification type of classes, classifier algorithm, and their parameters were specified, a model is needed for assessing the classification performance. Therefore, a cross-validation model is used to assess the classification performance. In k-fold cross validation technique, the dataset is randomly split into k equal-sized subsets. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together and used as training set (train on nine datasets and test on one). Then, the process is repeated ten times. In this empirical study, all models were built using 10-fold cross validation. The advantage of this method is that it data division into training and testing sets is irrelevant [<xref ref-type="bibr" rid="CR19">19</xref>
].</p>
<p id="Par38">The most significant part of many studies is discovered during evaluation, and the value of the study can be assessed. To compare all results of the applied algorithms to the MERS-CoV dataset in this project, their performances were quantitatively measured using accuracy, which is the most widely used evaluation metric to reflect the percentage of correctly-classified records in the testing phase [<xref ref-type="bibr" rid="CR21">21</xref>
]. Therefore, the accurate classifier will be useful for building a MERS-CoV prediction system.</p>
<p id="Par39">A confusion matrix is an important way for analyzing the performance of a binary-class classification. Moreover, in this matrix, each row contains information about actual class while each column contains information about predicted class. Accordingly, the confusion matrix aims to analyze how well a classifier can recognize tuples of different classes. Table <xref rid="Tab2" ref-type="table">2</xref>
 illustrates the confusion matrix for a two-class classifier [<xref ref-type="bibr" rid="CR20">20</xref>
].<table-wrap id="Tab2"><label>Table 2.</label>
<caption><p>Confusion matrix</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left"></th>
<th align="left">Negative (predicted)</th>
<th align="left">Positive (predicted)</th>
</tr>
</thead>
<tbody><tr><td align="left">Negative (actual)</td>
<td align="left">TP</td>
<td align="left">FN</td>
</tr>
<tr><td align="left">Positive (actual)</td>
<td align="left">FT</td>
<td align="left">TN</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par40">For evaluating the classification framework based on the confusion matrix, the accuracy formula of each classification type was used for the binary-class classifications; the accuracy was calculated based on the following formula:<disp-formula id="Equ6"><label>6</label>
<alternatives><tex-math id="M16">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ Accuracy = \frac{{100 \times ({\text{TP}} + {\text{TN}})}}{{{\text{TP}} + {\text{FN}} + {\text{TN}} + {\text{FP}}}} $$\end{document}</tex-math>
<graphic xlink:href="473257_1_En_74_Chapter_Equ6.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par41">On the other side, for the multi-class classifications, the accuracy was calculated based on the following formula:<disp-formula id="Equ7"><label>7</label>
<alternatives><tex-math id="M17">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$ Accuracy = \frac{{\mathop \sum \nolimits_{i = 1}^{l} \frac{{100 \times (TP_{i} + TN_{i} )}}{{TP_{i} + FN_{i} + TN_{i} + FP_{i} }}}}{l} $$\end{document}</tex-math>
<graphic xlink:href="473257_1_En_74_Chapter_Equ7.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
</p>
<p id="Par42">Where, TP (True Positives) is the correctly classified positive cases, TN (True Negative) is the correctly classified negative cases, FP (False Positives) is the incorrectly classified negative cases, and FN (False Negative) is the incorrectly classified positive cases [<xref ref-type="bibr" rid="CR21">21</xref>
].</p>
</sec>
</sec>
<sec id="Sec7"><title>Results and Discussion</title>
<p id="Par43">Based on the essential parameters of the classifier models, which are presented in the methodology section, the accuracies obtained on the MERS-CoV dataset with each classifier model for each classification type are shown in Table <xref rid="Tab3" ref-type="table">3</xref>
. The best accuracy is for the binary-class classification based on healthcare personnel class with 86.44%, which was produced by SVM and Decision Tree algorithms. Figure <xref rid="Fig4" ref-type="fig">4</xref>
 illustrates the result of the binary-class classification; when applying SVM with the healthcare personnel class, the margin width was maximizing, making the prediction faster and more accurate. On other hand, as healthcare personnel class became a root of the decision tree the depth of the tree was minimized and the tree is not complex that generates accurate predictions.<table-wrap id="Tab3"><label>Table 3.</label>
<caption><p>Classifier model accuracy for each classification type</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Classification type</th>
<th align="left">SVM classifier</th>
<th align="left">Decision tree classifier</th>
<th align="left"><italic>k</italic>
-NN classifier</th>
</tr>
</thead>
<tbody><tr><td align="left">Binary-class classification</td>
<td align="left">86.44%</td>
<td align="left">86.44%</td>
<td align="left">85.31%</td>
</tr>
<tr><td align="left">Multi-class classification</td>
<td align="left">18.24%</td>
<td align="left">42.80%</td>
<td align="left">30.80%</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="Fig4"><label>Fig. 4.</label>
<caption><p>Binary-class classification accuracy.</p>
</caption>
<graphic xlink:href="473257_1_En_74_Fig4_HTML" id="MO11"></graphic>
</fig>
</p>
<p id="Par44">Another important finding is that, for multi-class classification the Decision Tree obtains the highest accuracy with 42.80% based on city class. Whereas the accuracy of <italic>k</italic>
-NN was 30.80% and SVM classifier was under 20% as shown in Fig. <xref rid="Fig5" ref-type="fig">5</xref>
. Moreover, for evaluating the effectiveness of the results of our method, we have to compare the experimental results with the results of a recent study [<xref ref-type="bibr" rid="CR7">7</xref>
]. A recent study [<xref ref-type="bibr" rid="CR7">7</xref>
] reported highest accuracy of 51.60% for multi-class classification based on <italic>k</italic>
-NN classifier. This could be due to the value of parameter <italic>k</italic>
 that is set to 5 in this study while it is set to 3 in [<xref ref-type="bibr" rid="CR7">7</xref>
].<fig id="Fig5"><label>Fig. 5.</label>
<caption><p>Multi-class classification accuracy.</p>
</caption>
<graphic xlink:href="473257_1_En_74_Fig5_HTML" id="MO12"></graphic>
</fig>
</p>
<p id="Par45">On another side, the researchers in [<xref ref-type="bibr" rid="CR7">7</xref>
] reported a higher accuracy on binary-class classification of 90.00%, when using the Decision Tree algorithm based on gender attribute, while our method achieved a lower accuracy of 86.44% when using SVM and Decision Tree classifiers based on healthcare personnel attribute. Therefore, using healthcare personnel attribute as binary-class may not be appropriate for MERS-CoV dataset classification.</p>
<p id="Par46">The experimental results demonstrate that the accurate classifier models for binary-class and multi-class classification types are built by using Decision Tree and <italic>k</italic>
-NN algorithms respectively. Additionally, the results of this study indicate that using SVM classifier is not suitable for classification of MERS-CoV dataset. In general, the main explanation of our results is based on the essential parameter settings.</p>
</sec>
<sec id="Sec8"><title>Conclusions and Future Work</title>
<p id="Par47">The classifier model performance of several classification types can greatly assist to enhance the prediction accuracy of MERS-CoV infection. In this study, we have identified a classifier model performance that applied binary and multiclass classification on real a MERS-CoV dataset. Three algorithms were used to build classifier models, which were SVM, Decision Tree, and <italic>k</italic>
-NN. The algorithms were applied using RapidMiner, a data mining tool. The performance of classifier models was measured using the accuracy evaluation metric; in addition, cross-validation was used as a model for assessing classification performance.</p>
<p id="Par48">The experimental results have shown that both SVM and Decision Tree classifiers achieved the highest accuracy of 86.44% on binary-class classification based on healthcare personnel class. On the other hand, the Decision Tree classifier had the highest accuracy of 42.80% among the remaining classifiers for multiclass classification based on city class, although it did not reach a satisfactory accuracy level. In general, the comparison of the experimental results and the results of a recent study indicate that Decision Tree and <italic>k</italic>
-NN classifiers are the accurate classifiers for binary-class and multi-class classification types respectively. Additionally, using an SVM classifier is not suitable for classification of a MERS-CoV dataset. For future work, it is intended that this experiment will be applied to the universal MERS dataset. Furthermore, other preprocessing technique such as remove missing value can be used to measure its effect on the classifier models’ performance. Additionally, other classification methods, such as ensemble learning, can be used. Also, another similarity metric, such as cosine similarity, may be used with the <italic>k</italic>
-NN algorithm. Finally, for the multiclass classification, we suggest recreating the empirical study with different parameters to determine a classifier that gives accuracy greater than 50%.</p>
</sec>
</body>
<back><ref-list id="Bib1"><title>References</title>
<ref id="CR1"><label>1.</label>
<mixed-citation publication-type="other">Coronavirus website - Ministry of Health. <ext-link ext-link-type="uri" xlink:href="http://www.moh.gov.sa/en/CCC/">http://www.moh.gov.sa/en/CCC/</ext-link>
. Accessed 29 Oct 2017</mixed-citation>
</ref>
<ref id="CR2"><label>2.</label>
<mixed-citation publication-type="other">WHO: Middle East respiratory syndrome coronavirus (MERS-CoV). <ext-link ext-link-type="uri" xlink:href="http://www.who.int/emergencies/mers-cov/en/">http://www.who.int/emergencies/mers-cov/en/</ext-link>
. Accessed 23 Oct 2017</mixed-citation>
</ref>
<ref id="CR3"><label>3.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Koh</surname>
<given-names>HC</given-names>
</name>
<name><surname>Tan</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Data mining applications in healthcare</article-title>
<source>J. Healthc. Inf. Manag.</source>
<year>2005</year>
<volume>19</volume>
<issue>2</issue>
<fpage>64</fpage>
<lpage>72</lpage>
<pub-id pub-id-type="pmid">15869215</pub-id>
</element-citation>
</ref>
<ref id="CR4"><label>4.</label>
<mixed-citation publication-type="other">Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Haryana, India, Burlington (2012)</mixed-citation>
</ref>
<ref id="CR5"><label>5.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yoo</surname>
</name>
<etal></etal>
</person-group>
<article-title>Data mining in healthcare and biomedicine: a survey of the literature</article-title>
<source>J. Med. Syst.</source>
<year>2012</year>
<volume>36</volume>
<issue>4</issue>
<fpage>2431</fpage>
<lpage>2448</lpage>
<pub-id pub-id-type="doi">10.1007/s10916-011-9710-5</pub-id>
<pub-id pub-id-type="pmid">21537851</pub-id>
</element-citation>
</ref>
<ref id="CR6"><label>6.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Al-Turaiki</surname>
<given-names>Isra</given-names>
</name>
<name><surname>Alshahrani</surname>
<given-names>Mona</given-names>
</name>
<name><surname>Almutairi</surname>
<given-names>Tahani</given-names>
</name>
</person-group>
<article-title>Building predictive models for MERS-CoV infections using data mining techniques</article-title>
<source>Journal of Infection and Public Health</source>
<year>2016</year>
<volume>9</volume>
<issue>6</issue>
<fpage>744</fpage>
<lpage>748</lpage>
<pub-id pub-id-type="doi">10.1016/j.jiph.2016.09.007</pub-id>
<pub-id pub-id-type="pmid">27641481</pub-id>
</element-citation>
</ref>
<ref id="CR7"><label>7.</label>
<mixed-citation publication-type="other">AlMansour, N., Kurdi, H.: Identifying accurate classifier models for a text - based MERS-CoV dataset. Presented at the Intelligent Systems Conference 2017, London, UK (2017)</mixed-citation>
</ref>
<ref id="CR8"><label>8.</label>
<mixed-citation publication-type="other">Deepika, K., Seema, S.: Predictive analytics to prevent and control chronic diseases, pp. 381–386 (2016)</mixed-citation>
</ref>
<ref id="CR9"><label>9.</label>
<mixed-citation publication-type="other">Ferreira, D., Oliveira, A., Freitas, A.: Applying data mining techniques to improve diagnosis in neonatal jaundice. BMC Med. Inform. Decis. Mak. <bold>12</bold>
(1), December 2012</mixed-citation>
</ref>
<ref id="CR10"><label>10.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Asri</surname>
<given-names>H</given-names>
</name>
<name><surname>Mousannif</surname>
<given-names>H</given-names>
</name>
<name><surname>Moatassime</surname>
<given-names>HA</given-names>
</name>
<name><surname>Noel</surname>
<given-names>T</given-names>
</name>
</person-group>
<article-title>Using machine learning algorithms for breast cancer risk prediction and diagnosis</article-title>
<source>Procedia Comput. Sci.</source>
<year>2016</year>
<volume>83</volume>
<fpage>1064</fpage>
<lpage>1069</lpage>
<pub-id pub-id-type="doi">10.1016/j.procs.2016.04.224</pub-id>
</element-citation>
</ref>
<ref id="CR11"><label>11.</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Li</surname>
<given-names>J</given-names>
</name>
<name><surname>Zhao</surname>
<given-names>Z</given-names>
</name>
<name><surname>Liu</surname>
<given-names>Y</given-names>
</name>
<name><surname>Cheng</surname>
<given-names>Z</given-names>
</name>
<name><surname>Wang</surname>
<given-names>X</given-names>
</name>
</person-group>
<person-group person-group-type="editor"><name><surname>Yen</surname>
<given-names>NY</given-names>
</name>
<name><surname>Hung</surname>
<given-names>JC</given-names>
</name>
</person-group>
<article-title>A comparative study on machine classification model in lung cancer cases analysis</article-title>
<source>Frontier Computing</source>
<year>2018</year>
<publisher-loc>Singapore</publisher-loc>
<publisher-name>Springer Singapore</publisher-name>
<fpage>343</fpage>
<lpage>357</lpage>
</element-citation>
</ref>
<ref id="CR12"><label>12.</label>
<mixed-citation publication-type="other">Daghistani, T., Alshammari, R.: Diagnosis of diabetes by applying data mining classification techniques. Int. J. Adv. Comput. Sci. Appl. <bold>7</bold>
(7) (2016)</mixed-citation>
</ref>
<ref id="CR13"><label>13.</label>
<mixed-citation publication-type="other">Sowjanya, K., Singhal, A., Choudhary, C.: MobDBTest: a machine learning based system for predicting diabetes risk using mobile devices, pp. 397–402 (2015)</mixed-citation>
</ref>
<ref id="CR14"><label>14.</label>
<mixed-citation publication-type="other">Kim, D., Hong, S., Choi, S., Yoon, T.: Analysis of transmission route of MERS coronavirus using decision tree and apriori algorithm, pp. 559–565 (2016)</mixed-citation>
</ref>
<ref id="CR15"><label>15.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sandhu</surname>
<given-names>R</given-names>
</name>
<name><surname>Sood</surname>
<given-names>SK</given-names>
</name>
<name><surname>Kaur</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>An intelligent system for predicting and preventing MERS-CoV infection outbreak</article-title>
<source>J. Supercomput.</source>
<year>2016</year>
<volume>72</volume>
<issue>8</issue>
<fpage>3033</fpage>
<lpage>3056</lpage>
<pub-id pub-id-type="doi">10.1007/s11227-015-1474-0</pub-id>
<pub-id pub-id-type="pmid">32214655</pub-id>
</element-citation>
</ref>
<ref id="CR16"><label>16.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jang</surname>
<given-names>Seongpil</given-names>
</name>
<name><surname>Lee</surname>
<given-names>Seunghwan</given-names>
</name>
<name><surname>Choi</surname>
<given-names>Seong-Min</given-names>
</name>
<name><surname>Seo</surname>
<given-names>Junwon</given-names>
</name>
<name><surname>Choi</surname>
<given-names>Hunseok</given-names>
</name>
<name><surname>Yoon</surname>
<given-names>Taeseon</given-names>
</name>
</person-group>
<article-title>Comparison between SARS CoV and MERS CoV Using Apriori Algorithm, Decision Tree, SVM</article-title>
<source>MATEC Web of Conferences</source>
<year>2016</year>
<volume>49</volume>
<fpage>08001</fpage>
<pub-id pub-id-type="doi">10.1051/matecconf/20164908001</pub-id>
</element-citation>
</ref>
<ref id="CR17"><label>17.</label>
<mixed-citation publication-type="other">RapidMiner Studio - RapidMiner Documentation. <ext-link ext-link-type="uri" xlink:href="http://docs.rapidminer.com/studio/">http://docs.rapidminer.com/studio/</ext-link>
. Accessed 11 Jan 2017</mixed-citation>
</ref>
<ref id="CR18"><label>18.</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Witten</surname>
<given-names>H</given-names>
</name>
<name><surname>Frank</surname>
<given-names>E</given-names>
</name>
<name><surname>Hall</surname>
<given-names>MA</given-names>
</name>
</person-group>
<source>Data Mining: Practical Machine Learning Tools and Techniques</source>
<year>2011</year>
<edition>3</edition>
<publisher-loc>Burlington</publisher-loc>
<publisher-name>Morgan Kaufmann</publisher-name>
</element-citation>
</ref>
<ref id="CR19"><label>19.</label>
<mixed-citation publication-type="other">Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 2, pp. 1137–1143 (1995)</mixed-citation>
</ref>
<ref id="CR20"><label>20.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Stehman</surname>
<given-names>SV</given-names>
</name>
</person-group>
<article-title>Selecting and interpreting measures of thematic classification accuracy</article-title>
<source>Remote Sens. Environ.</source>
<year>1997</year>
<volume>62</volume>
<issue>1</issue>
<fpage>77</fpage>
<lpage>89</lpage>
<pub-id pub-id-type="doi">10.1016/S0034-4257(97)00083-7</pub-id>
</element-citation>
</ref>
<ref id="CR21"><label>21.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sokolova</surname>
<given-names>M</given-names>
</name>
<name><surname>Lapalme</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>A systematic analysis of performance measures for classification tasks</article-title>
<source>Inf. Process. Manag.</source>
<year>2009</year>
<volume>45</volume>
<issue>4</issue>
<fpage>427</fpage>
<lpage>437</lpage>
<pub-id pub-id-type="doi">10.1016/j.ipm.2009.03.002</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000195 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000195 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:7123473
   |texte=   Selecting Accurate Classifier Models for a MERS-CoV Dataset
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:NONE" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021

	Serveur d'exploration MERS
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration MERS

Selecting Accurate Classifier Models for a MERS-CoV Dataset

Selecting Accurate Classifier Models for a MERS-CoV Dataset

Source :

Abstract

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri

Pour générer des pages wiki