Serveur d'exploration Covid (26 mars)

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Classifying and Summarizing Information from Microblogs During Epidemics

Identifieur interne : 000268 ( Pmc/Corpus ); précédent : 000267; suivant : 000269

Classifying and Summarizing Information from Microblogs During Epidemics

Auteurs : Koustav Rudra ; Ashish Sharma ; Niloy Ganguly ; Muhammad Imran

Source :

RBID : PMC:7087635

Abstract

During a new disease outbreak, frustration and uncertainties among affected and vulnerable population increase. Affected communities look for known symptoms, prevention measures, and treatment strategies. On the other hand, health organizations try to get situational updates to assess the severity of the outbreak, known affected cases, and other details. Recent emergence of social media platforms such as Twitter provide convenient ways and fast access to disseminate and consume information to/from a wider audience. Research studies have shown potential of this online information to address information needs of concerned authorities during outbreaks, epidemics, and pandemics. In this work, we target three types of end-users (i) vulnerable population—people who are not yet affected and are looking for prevention related information (ii) affected population—people who are affected and looking for treatment related information, and (iii) health organizations—like WHO, who are interested in gaining situational awareness to make timely decisions. We use Twitter data from two recent outbreaks (Ebola and MERS) to build an automatic classification approach useful to categorize tweets into different disease related categories. Moreover, the classified messages are used to generate different kinds of summaries useful for affected and vulnerable communities as well as health organizations. Results obtained from extensive experimentation show the effectiveness of the proposed approach.


Url:
DOI: 10.1007/s10796-018-9844-9
PubMed: NONE
PubMed Central: 7087635

Links to Exploration step

PMC:7087635

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Classifying and Summarizing Information from Microblogs During Epidemics</title>
<author>
<name sortKey="Rudra, Koustav" sort="Rudra, Koustav" uniqKey="Rudra K" first="Koustav" last="Rudra">Koustav Rudra</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0153 2859</institution-id>
<institution-id institution-id-type="GRID">grid.429017.9</institution-id>
<institution>IIT Kharagpur,</institution>
</institution-wrap>
Kharagpur, India</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sharma, Ashish" sort="Sharma, Ashish" uniqKey="Sharma A" first="Ashish" last="Sharma">Ashish Sharma</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0153 2859</institution-id>
<institution-id institution-id-type="GRID">grid.429017.9</institution-id>
<institution>IIT Kharagpur,</institution>
</institution-wrap>
Kharagpur, India</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ganguly, Niloy" sort="Ganguly, Niloy" uniqKey="Ganguly N" first="Niloy" last="Ganguly">Niloy Ganguly</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0153 2859</institution-id>
<institution-id institution-id-type="GRID">grid.429017.9</institution-id>
<institution>IIT Kharagpur,</institution>
</institution-wrap>
Kharagpur, India</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Imran, Muhammad" sort="Imran, Muhammad" uniqKey="Imran M" first="Muhammad" last="Imran">Muhammad Imran</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1789 3191</institution-id>
<institution-id institution-id-type="GRID">grid.452146.0</institution-id>
<institution>Qatar Computing Research Institute,</institution>
<institution>HBKU,</institution>
</institution-wrap>
Doha, Qatar</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmc">7087635</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7087635</idno>
<idno type="RBID">PMC:7087635</idno>
<idno type="doi">10.1007/s10796-018-9844-9</idno>
<idno type="pmid">NONE</idno>
<date when="2018">2018</date>
<idno type="wicri:Area/Pmc/Corpus">000268</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000268</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Classifying and Summarizing Information from Microblogs During Epidemics</title>
<author>
<name sortKey="Rudra, Koustav" sort="Rudra, Koustav" uniqKey="Rudra K" first="Koustav" last="Rudra">Koustav Rudra</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0153 2859</institution-id>
<institution-id institution-id-type="GRID">grid.429017.9</institution-id>
<institution>IIT Kharagpur,</institution>
</institution-wrap>
Kharagpur, India</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Sharma, Ashish" sort="Sharma, Ashish" uniqKey="Sharma A" first="Ashish" last="Sharma">Ashish Sharma</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0153 2859</institution-id>
<institution-id institution-id-type="GRID">grid.429017.9</institution-id>
<institution>IIT Kharagpur,</institution>
</institution-wrap>
Kharagpur, India</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Ganguly, Niloy" sort="Ganguly, Niloy" uniqKey="Ganguly N" first="Niloy" last="Ganguly">Niloy Ganguly</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0153 2859</institution-id>
<institution-id institution-id-type="GRID">grid.429017.9</institution-id>
<institution>IIT Kharagpur,</institution>
</institution-wrap>
Kharagpur, India</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Imran, Muhammad" sort="Imran, Muhammad" uniqKey="Imran M" first="Muhammad" last="Imran">Muhammad Imran</name>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1789 3191</institution-id>
<institution-id institution-id-type="GRID">grid.452146.0</institution-id>
<institution>Qatar Computing Research Institute,</institution>
<institution>HBKU,</institution>
</institution-wrap>
Doha, Qatar</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Information Systems Frontiers</title>
<idno type="ISSN">1387-3326</idno>
<idno type="eISSN">1572-9419</idno>
<imprint>
<date when="2018">2018</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p id="Par5">During a new disease outbreak, frustration and uncertainties among affected and vulnerable population increase. Affected communities look for known symptoms, prevention measures, and treatment strategies. On the other hand, health organizations try to get situational updates to assess the severity of the outbreak, known affected cases, and other details. Recent emergence of social media platforms such as Twitter provide convenient ways and fast access to disseminate and consume information to/from a wider audience. Research studies have shown potential of this online information to address information needs of concerned authorities during outbreaks, epidemics, and pandemics. In this work, we target three types of end-users (i) vulnerable population—people who are not yet affected and are looking for prevention related information (ii) affected population—people who are affected and looking for treatment related information, and (iii) health organizations—like WHO, who are interested in gaining situational awareness to make timely decisions. We use Twitter data from two recent outbreaks (Ebola and MERS) to build an automatic classification approach useful to categorize tweets into different disease related categories. Moreover, the classified messages are used to generate different kinds of summaries useful for affected and vulnerable communities as well as health organizations. Results obtained from extensive experimentation show the effectiveness of the proposed approach.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bodenreider, O" uniqKey="Bodenreider O">O Bodenreider</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Denecke, K" uniqKey="Denecke K">K Denecke</name>
</author>
<author>
<name sortKey="Nejdl, W" uniqKey="Nejdl W">W Nejdl</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fox, S" uniqKey="Fox S">S Fox</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Friedman, C" uniqKey="Friedman C">C Friedman</name>
</author>
<author>
<name sortKey="Hripcsak, G" uniqKey="Hripcsak G">G Hripcsak</name>
</author>
<author>
<name sortKey="Shagina, L" uniqKey="Shagina L">L Shagina</name>
</author>
<author>
<name sortKey="Liu, H" uniqKey="Liu H">H Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Friedman, C" uniqKey="Friedman C">C Friedman</name>
</author>
<author>
<name sortKey="Shagina, L" uniqKey="Shagina L">L Shagina</name>
</author>
<author>
<name sortKey="Lussier, Y" uniqKey="Lussier Y">Y Lussier</name>
</author>
<author>
<name sortKey="Hripcsak, G" uniqKey="Hripcsak G">G Hripcsak</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hripcsak, G" uniqKey="Hripcsak G">G Hripcsak</name>
</author>
<author>
<name sortKey="Austin, Jh" uniqKey="Austin J">JH Austin</name>
</author>
<author>
<name sortKey="Alderson, Po" uniqKey="Alderson P">PO Alderson</name>
</author>
<author>
<name sortKey="Friedman, C" uniqKey="Friedman C">C Friedman</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kinnane, Na" uniqKey="Kinnane N">NA Kinnane</name>
</author>
<author>
<name sortKey="Milne, Dj" uniqKey="Milne D">DJ Milne</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Park, A" uniqKey="Park A">A Park</name>
</author>
<author>
<name sortKey="Hartzler, Al" uniqKey="Hartzler A">AL Hartzler</name>
</author>
<author>
<name sortKey="Huh, J" uniqKey="Huh J">J Huh</name>
</author>
<author>
<name sortKey="Mcdonald, Dw" uniqKey="Mcdonald D">DW McDonald</name>
</author>
<author>
<name sortKey="Pratt, W" uniqKey="Pratt W">W Pratt</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Paul, Mj" uniqKey="Paul M">MJ Paul</name>
</author>
<author>
<name sortKey="Dredze, M" uniqKey="Dredze M">M Dredze</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pedregosa, F" uniqKey="Pedregosa F">F Pedregosa</name>
</author>
<author>
<name sortKey="Varoquaux, G" uniqKey="Varoquaux G">G Varoquaux</name>
</author>
<author>
<name sortKey="Gramfort, A" uniqKey="Gramfort A">A Gramfort</name>
</author>
<author>
<name sortKey="Michel, V" uniqKey="Michel V">V Michel</name>
</author>
<author>
<name sortKey="Thirion, B" uniqKey="Thirion B">B Thirion</name>
</author>
<author>
<name sortKey="Grisel, O" uniqKey="Grisel O">O Grisel</name>
</author>
<author>
<name sortKey="Blondel, M" uniqKey="Blondel M">M Blondel</name>
</author>
<author>
<name sortKey="Prettenhofer, P" uniqKey="Prettenhofer P">P Prettenhofer</name>
</author>
<author>
<name sortKey="Weiss, R" uniqKey="Weiss R">R Weiss</name>
</author>
<author>
<name sortKey="Dubourg, V" uniqKey="Dubourg V">V Dubourg</name>
</author>
<author>
<name sortKey="Vanderplas, J" uniqKey="Vanderplas J">J Vanderplas</name>
</author>
<author>
<name sortKey="Passos, A" uniqKey="Passos A">A Passos</name>
</author>
<author>
<name sortKey="Cournapeau, D" uniqKey="Cournapeau D">D Cournapeau</name>
</author>
<author>
<name sortKey="Brucher, M" uniqKey="Brucher M">M Brucher</name>
</author>
<author>
<name sortKey="Perrot, M" uniqKey="Perrot M">M Perrot</name>
</author>
<author>
<name sortKey="Duchesnay, E" uniqKey="Duchesnay E">E Duchesnay</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roberts, K" uniqKey="Roberts K">K Roberts</name>
</author>
<author>
<name sortKey="Harabagiu, Sm" uniqKey="Harabagiu S">SM Harabagiu</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Savova, Gk" uniqKey="Savova G">GK Savova</name>
</author>
<author>
<name sortKey="Masanz, Jj" uniqKey="Masanz J">JJ Masanz</name>
</author>
<author>
<name sortKey="Ogren, Pv" uniqKey="Ogren P">PV Ogren</name>
</author>
<author>
<name sortKey="Zheng, J" uniqKey="Zheng J">J Zheng</name>
</author>
<author>
<name sortKey="Sohn, S" uniqKey="Sohn S">S Sohn</name>
</author>
<author>
<name sortKey="Kipper Schuler, Kc" uniqKey="Kipper Schuler K">KC Kipper-Schuler</name>
</author>
<author>
<name sortKey="Chute, Cg" uniqKey="Chute C">CG Chute</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Scanfeld, D" uniqKey="Scanfeld D">D Scanfeld</name>
</author>
<author>
<name sortKey="Scanfeld, V" uniqKey="Scanfeld V">V Scanfeld</name>
</author>
<author>
<name sortKey="Larson, El" uniqKey="Larson E">EL Larson</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Uzuner, O" uniqKey="Uzuner O">Ö Uzuner</name>
</author>
<author>
<name sortKey="South, Br" uniqKey="South B">BR South</name>
</author>
<author>
<name sortKey="Shen, S" uniqKey="Shen S">S Shen</name>
</author>
<author>
<name sortKey="Duvall, Sl" uniqKey="Duvall S">SL DuVall</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yang, Fc" uniqKey="Yang F">FC Yang</name>
</author>
<author>
<name sortKey="Lee, Aj" uniqKey="Lee A">AJ Lee</name>
</author>
<author>
<name sortKey="Kuo, Sc" uniqKey="Kuo S">SC Kuo</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Inf Syst Front</journal-id>
<journal-id journal-id-type="iso-abbrev">Inf Syst Front</journal-id>
<journal-title-group>
<journal-title>Information Systems Frontiers</journal-title>
</journal-title-group>
<issn pub-type="ppub">1387-3326</issn>
<issn pub-type="epub">1572-9419</issn>
<publisher>
<publisher-name>Springer US</publisher-name>
<publisher-loc>New York</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmc">7087635</article-id>
<article-id pub-id-type="publisher-id">9844</article-id>
<article-id pub-id-type="doi">10.1007/s10796-018-9844-9</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Classifying and Summarizing Information from Microblogs During Epidemics</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0002-2486-7608</contrib-id>
<name>
<surname>Rudra</surname>
<given-names>Koustav</given-names>
</name>
<address>
<email>koustav.rudra@cse.iitkgp.ernet.in</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<bio>
<sec id="d29e182">
<title>Koustav Rudra</title>
<p id="Par1">received the B.E. degree in computer science from the Indian Institute of Engineering Science and Technology, Shibpur, Howrah, India, and the M.Tech degree from IIT Kharagpur, India. He is currently pursuing the Ph.D. degree with the Department of Computer Science and Engineering, IIT Kharagpur, Kharagpur, India. His current research interests include social networks, information retrieval, and data mining.</p>
</sec>
</bio>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Sharma</surname>
<given-names>Ashish</given-names>
</name>
<address>
<email>ashishshrma22@gmail.com</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<bio>
<sec id="d29e197">
<title>Ashish Sharma</title>
<p id="Par2">is currently pursuing the Dual degree with the Department of Computer Science and Engineering, IIT Kharagpur, Kharagpur, India. His current research interests include social networks and information retrieval.</p>
</sec>
</bio>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ganguly</surname>
<given-names>Niloy</given-names>
</name>
<address>
<email>niloy@cse.iitkgp.ernet.in</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<bio>
<sec id="d29e212">
<title>Niloy Ganguly</title>
<p id="Par3">received the B.Tech. degree from IIT Kharagpur, Kharagpur, India, and the Ph.D. degree from the Indian Institute of Engineering Science and Technology, Shibpur, Howrah, India. He was a Post-Doctoral Fellow with Technical University, Dresden, Germany. He is currently a Professor with the Department of Computer Science and Engineering, IIT Kharagpur, where he leads the Complex Networks Research Group. His current research interests include complex networks, social networks, and mobile systems.</p>
</sec>
</bio>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Imran</surname>
<given-names>Muhammad</given-names>
</name>
<address>
<email>mimran@hbku.edu.qa</email>
</address>
<xref ref-type="aff" rid="Aff2">2</xref>
<bio>
<sec id="d29e227">
<title>Muhammad Imran</title>
<p id="Par4">is a Scientist at the Qatar Computing Research Institute (QCRI) where he leads the Crisis Computing team. His interdisciplinary research focuses on natural language processing, text mining, human-computer interaction, applied machine learning, and stream processing areas. Dr. Imran has published over 50 research papers in top-tier international conferences and journals. Two of his papers have received the Best Paper Award. He has been serving as a Co-Chair of the Social Media Studies track of the ISCRAM international conference since 2014 and has served as Program Committee (PC) for many major conferences and workshops including SIGIR, ICWSM, ACM DH, ICWE, SWDM. Dr. Imran has worked as a Post-Doctoral researcher at QCRI (2013-2015). He received his Ph.D. in Computer Science from the University of Trento, Italy (2013), where he also used to co-teach various computer science courses (2009-2012).</p>
</sec>
</bio>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0001 0153 2859</institution-id>
<institution-id institution-id-type="GRID">grid.429017.9</institution-id>
<institution>IIT Kharagpur,</institution>
</institution-wrap>
Kharagpur, India</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 1789 3191</institution-id>
<institution-id institution-id-type="GRID">grid.452146.0</institution-id>
<institution>Qatar Computing Research Institute,</institution>
<institution>HBKU,</institution>
</institution-wrap>
Doha, Qatar</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>20</day>
<month>3</month>
<year>2018</year>
</pub-date>
<pub-date pub-type="ppub">
<year>2018</year>
</pub-date>
<volume>20</volume>
<issue>5</issue>
<fpage>933</fpage>
<lpage>948</lpage>
<permissions>
<copyright-statement>© Springer Science+Business Media, LLC, part of Springer Nature 2018</copyright-statement>
<license>
<license-p>This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<p id="Par5">During a new disease outbreak, frustration and uncertainties among affected and vulnerable population increase. Affected communities look for known symptoms, prevention measures, and treatment strategies. On the other hand, health organizations try to get situational updates to assess the severity of the outbreak, known affected cases, and other details. Recent emergence of social media platforms such as Twitter provide convenient ways and fast access to disseminate and consume information to/from a wider audience. Research studies have shown potential of this online information to address information needs of concerned authorities during outbreaks, epidemics, and pandemics. In this work, we target three types of end-users (i) vulnerable population—people who are not yet affected and are looking for prevention related information (ii) affected population—people who are affected and looking for treatment related information, and (iii) health organizations—like WHO, who are interested in gaining situational awareness to make timely decisions. We use Twitter data from two recent outbreaks (Ebola and MERS) to build an automatic classification approach useful to categorize tweets into different disease related categories. Moreover, the classified messages are used to generate different kinds of summaries useful for affected and vulnerable communities as well as health organizations. Results obtained from extensive experimentation show the effectiveness of the proposed approach.</p>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Health crisis</kwd>
<kwd>Epidemic</kwd>
<kwd>Twitter</kwd>
<kwd>Classification</kwd>
<kwd>Summarization</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© Springer Science+Business Media, LLC, part of Springer Nature 2018</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1">
<title>Introduction</title>
<p id="Par6">During disease outbreaks, information posted on microblogging platforms such as Twitter by affected people provide rapid access to diverse and useful insights helpful to understand various facets of the outbreak. Research studies conducted with formal health organizations have shown the potential of such health-related information on Twitter for a quick response (De Choudhury
<xref ref-type="bibr" rid="CR5">2015</xref>
). However, during an ongoing epidemic situation, in order for health organizations to effectively use this online information for response efforts or decision-making processes, the information should be processed and analyzed as quickly as possible. During an epidemic, social media users post millions of messages containing information about disease sign and symptoms, prevention strategies, transmission mediums, death reports, personal opinions and experiences.</p>
<p id="Par7">To enable health experts understand and use this online information for decision making, messages must be categorized into different informative categories (e.g. symptom reports, prevention reports, treatment reports) and irrelevant content should be discarded. Although the categorization step helps organize related messages into categories, each category may still contain thousands of messages which would again be difficult to manually process by health experts as well as by affected or vulnerable people. While the key information contained in these tweets is useful for the health experts in their decision-making process, we also observe that different disease categories contain different traits (e.g., specific symptoms characteristics), which can be exploited in order to extract and summarize relevant information.</p>
<p id="Par8">Moreover, we observe that different stakeholders (e.g. health organizations and affected or vulnerable user groups) have different information needs. In this paper, we target the following three user groups/population—(i)
<bold>Vulnerable population:</bold>
people who are primarily looking for preventive measures, signs or symptoms of a disease to take precautionary measures. These are not affected people but they are vulnerable. (ii)
<bold>Affected population:</bold>
people who are already under the influence of disease and trying to recover from the situation. (iii)
<bold>Health organizations:</bold>
primarily government and health organizations who look for general situational updates like ‘how many people died or under treatment’, ‘any new service required’, etc.</p>
<sec id="FPar1">
<title>
<bold>Assisting Vulnerable Population</bold>
</title>
<p id="Par9">The vulnerable groups look out for precautionary measures which can guard them against acquiring a disease. Our proposed system tries to extract various small-scale precautionary measures like signs and symptoms (such as ‘fever’, ‘flu’, ‘vomiting’, ‘diarrhea’), disease transmission mediums (‘Transmission of Ebola Virus By Air Possible’) etc., from related informational categories of tweets in order to assist these vulnerable groups. Automatic extraction of such precautionary measures or symptoms is a challenging task due to a number of reasons. For instance, we observe that such an information is floated in two flavors—(i) positive (confirmations): e.g. someone confirms that “flu” is a symptom of the Ebola virus or a tweet reports that people should follow x,y,z measures to avoid getting affected (ii) negative (contradictions): e.g. a tweet reports that “fever” is not a symptom of the Ebola virus. In this case, our system should clearly specify that “fever” is not a symptom of Ebola. In order to effectively tackle this, our system extracts the contextual information (positive or negative) of the terms related to precautionary measures such as symptoms, preventive suggestions and accordingly assists people during epidemics.</p>
</sec>
<sec id="FPar2">
<title>
<bold>Assisting Affected Population</bold>
</title>
<p id="Par10">In this case, the target community is considered already affected by the epidemic (e.g. users have already fallen sick). The users in this community look for treatment-related information or find nearby hospitals which deal with the ongoing epidemic. In order to assist these users, we extract recovery and treatment information from tweets. In case of contagious diseases, it is necessary to alert the affected user groups so that further transmission of the disease can be stopped.</p>
</sec>
<sec id="FPar3">
<title>
<bold>Assisting Health Organizations</bold>
</title>
<p id="Par11">During epidemics, government and other health monitoring agencies (WHO, CDC) look for information about victims, affected people, death reports, vulnerable people etc. so that they can judge the severity of the situation and accordingly take necessary actions like taking help of experts/doctors from other countries, setting up separate treatment centers. Many a time, travelers from foreign countries also get affected by sudden outbreaks. In such cases, local government has to inform their respective countries about the current status; sometimes they have to arrange special flights to send the affected people to their home countries. Considering all these perspectives, the proposed approach tries to extract relevant information about affected or vulnerable people.</p>
<p id="Par12">To the best of our knowledge, all previous research works regarding health and social media (De Choudhury
<xref ref-type="bibr" rid="CR5">2015</xref>
; de Quincey et al.
<xref ref-type="bibr" rid="CR6">2016</xref>
; Yom-Tov
<xref ref-type="bibr" rid="CR40">2015</xref>
) focus on analyzing behavioral and social aspects of users who post information about a particular disease and predict whether a user is going to encounter such disease in future based on her current posts. However, a generic model which could assist different stakeholders during an epidemic is important. We make the following contributions in this work:
<list list-type="bullet">
<title>Contributions</title>
<list-item>
<p id="Par13">We develop a classifier which uses low-level lexical features to distinguish between different disease categories. Vocabulary independent features allow our classifier to function accurately in cross-domain scenarios, e.g., when the classifier trained over tweets posted during some past outbreak is used to predict tweets posted during a future/current outbreak.</p>
</list-item>
<list-item>
<p id="Par14">From each of the identified information classes, we propose different information extraction-summarization techniques, which optimize the coverage of specific disease related terms using an Integer Linear Programming (ILP) approach. Information extracted in this phase helps fulfill information needs of different affected or vulnerable end-users.</p>
</list-item>
</list>
</p>
<p id="Par15">Note that, our epidemic tweet classification approach was first proposed in a prior study (Rudra et al.
<xref ref-type="bibr" rid="CR32">2017</xref>
). The present work extends our prior work as follows. After classification of tweets into different informative categories (symptom, prevention etc.), we propose novel ILP based information extraction-summarization methods for each of the classes which extracts disease related terms and maximizes their coverage in final summary.</p>
<p id="Par16">Figure 
<xref rid="Fig1" ref-type="fig">1</xref>
provides an overview of our approach. Experiments conducted over real Twitter datasets on two recent disease outbreaks (World Health Organization (WHO)
<xref ref-type="bibr" rid="CR38">2014</xref>
; Centers for Disease Control and Prevention
<xref ref-type="bibr" rid="CR4">2014</xref>
) show that the proposed low-level lexical classifier outperforms vocabulary based approach (Imran et al.
<xref ref-type="bibr" rid="CR20">2014</xref>
) in cross-domain scenario (Section 
<xref rid="Sec8" ref-type="sec">4</xref>
). Next, we show the utility of disease specific keyterms in capturing information from different disease related classes and summarizing those information (Section 
<xref rid="Sec13" ref-type="sec">5</xref>
). We evaluate the performance of our proposed summarization scheme in Section 
<xref rid="Sec19" ref-type="sec">6</xref>
. Our proposed ILP based summarization framework (MEDSUM) performs better compared to real time disaster summarization approach (COWTS) proposed by Rudra et al. (
<xref ref-type="bibr" rid="CR31">2015</xref>
). Section 
<xref rid="Sec25" ref-type="sec">7</xref>
shows how extracted information satisfies needs of various stakeholders. Finally, we conclude our paper in Section 
<xref rid="Sec26" ref-type="sec">8</xref>
.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Our proposed framework for classification-summarization of tweets posted during epidemic</p>
</caption>
<graphic xlink:href="10796_2018_9844_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
</sec>
</sec>
<sec id="Sec2">
<title>Related Work</title>
<p id="Par17">Twitter, Facebook, online health forums and message boards are increasingly being used by professionals and patients to obtain health information and share their health experiences (Kinnane and Milne
<xref ref-type="bibr" rid="CR22">2010</xref>
). Fox (
<xref ref-type="bibr" rid="CR11">2011</xref>
) reported that 80% of internet users use online resources for information about health topics like specific disease or treatment. Further, it was shown that 34% of health searchers use social media resources to find health related topics (Elkin
<xref ref-type="bibr" rid="CR9">2008</xref>
). The popularity of social media in medical and health domain has gained attention from researchers for studying various topics on healthcare. This section provides a brief overview of various researches conducted for utilizing medical social media data in order to extract meaningful information and shows how they are different from traditional systems used for clinical notes.</p>
<sec id="Sec3">
<title>Mining Information from Clinical Notes</title>
<p id="Par18">Various methods have been proposed for mining health and medical information from clinical notes. Most of these works have focused on extracting a broad class of medical conditions (e.g., diseases, injuries, and medical symptoms) and responses (e.g., diagnoses, procedures, and drugs), with the goal of developing applications that improve patient care (Friedman et al.
<xref ref-type="bibr" rid="CR12">1999</xref>
,
<xref ref-type="bibr" rid="CR13">2004</xref>
; Heinze et al.
<xref ref-type="bibr" rid="CR17">2001</xref>
; Hripcsak et al.
<xref ref-type="bibr" rid="CR19">2002</xref>
). The 2010 i2b2/VA challenge (Uzuner et al.
<xref ref-type="bibr" rid="CR37">2011</xref>
) presented the task of extracting medical concepts, tests and treatments from a given dataset. Most of the techniques follow Conditional Random Field (CRF) or rule based classifiers. Roberts et al. (Roberts and Harabagiu
<xref ref-type="bibr" rid="CR30">2011</xref>
) built a flexible framework for identifying medical concepts in clinical text, and classifying assertions, which indicate the existence, absence, or uncertainty of a medical problem. The framework was evaluated on the 2010 i2b2/VA challenge data. Recently Goodwin and Harabagiu (
<xref ref-type="bibr" rid="CR15">2016</xref>
) utilized this framework for building a clinical question-answering system. They used a probabilistic knowledge graph, generated from electronic medical records (EMRs), in order to carry out answer inference.</p>
</sec>
<sec id="Sec4">
<title>Mining Health Information from Social Media Data</title>
<p id="Par19">Scanfeld et al. (
<xref ref-type="bibr" rid="CR34">2010</xref>
) used Q-Methodology to determine the main categories of content contained in Twitter users’ status updates mentioning antibiotics. Lu et al. (
<xref ref-type="bibr" rid="CR25">2013</xref>
) built a framework based on clustering analysis technique to explore interesting health-related topics in online health community. It utilized sentiment based features extracted from SentiWordNet (Esuli and Sebastiani
<xref ref-type="bibr" rid="CR10">2007</xref>
) and domain specific features from MetaMap. Denecke and Nejdl (
<xref ref-type="bibr" rid="CR8">2009</xref>
) performed a comprehensive content analysis of different health related Web resources. It also classified medical weblogs according to their information type using features extracted from MetaMap. A framework based on Latent Dirichlet Allocation (LDA) to analyze discussion threads in a health community was proposed by Yang et al. (
<xref ref-type="bibr" rid="CR39">2016</xref>
). They first extracted medical concepts, used a modified LDA to cluster documents and finally performed sentiment analysis for each conditional topic. Recently large scale researches have been done in exploring how microblogs can be used to extract symptoms related to disease (Paul and Dredze
<xref ref-type="bibr" rid="CR28">2011</xref>
), mental health (Homan et al.
<xref ref-type="bibr" rid="CR18">2014</xref>
) and so on.</p>
<p id="Par20">Most of the methods proposed for extracting information from clinical text utilize earlier proposed systems (e.g., MetaMap (Aronson
<xref ref-type="bibr" rid="CR1">2001</xref>
), cTakes (Savova et al.
<xref ref-type="bibr" rid="CR33">2010</xref>
)) for mapping clinical documents to concepts of medical terminologies and ontologies (eg. UMLS (Bodenreider
<xref ref-type="bibr" rid="CR3">2004</xref>
), SNOMED CT (Stearns et al.
<xref ref-type="bibr" rid="CR35">2001</xref>
)). For a given text, these systems provide extracted terms concepts of clinical terminologies that can be used to describe the content of a document in a standardized way. However, tools like MetaMap were designed specifically to process clinical documents and are thus, specialized to their linguistic characteristics (Denecke
<xref ref-type="bibr" rid="CR7">2014</xref>
). The user-generated medical text from social media differs significantly from professionally written clinical notes. Recent studies have shown that directly applying Metamap on social media data leads to low quality word labels (Tu et al.
<xref ref-type="bibr" rid="CR36">2016</xref>
). There have also been works which propose methods for identifying the kind of failures MetaMap experiences when applied on social media data. Recently Park et al. (
<xref ref-type="bibr" rid="CR27">2014</xref>
) characterized failures of MetaMap into boundary failures, missed term failures and word sense ambiguity failures.</p>
<p id="Par21">Researchers also put lot of effort in designing text classification techniques (Imran et al.
<xref ref-type="bibr" rid="CR20">2014</xref>
; Rudra et al.
<xref ref-type="bibr" rid="CR31">2015</xref>
) suitable for microblogs. In our recent work, we propose a low-level lexical feature based classifier to classify tweets posted during epidemics (Rudra et al.
<xref ref-type="bibr" rid="CR32">2017</xref>
).</p>
<p id="Par22">To our knowledge, all the existing methods try to extract knowledge from past medical records to infer solutions, diagnoses or treatment. However, these techniques will not work for sudden outbreaks for which past medical records are not available. There does not exist any real time classification-summarization framework to extract, classify, and summarize information from microblogs in real time. In this work, we take first step to this problem and propose a real time classification-summarization framework which can be applied to future epidemics.</p>
</sec>
</sec>
<sec id="Sec5">
<title>Dataset and Classification of Messages</title>
<p id="Par23">This section describes the datasets of tweets that are used to evaluate our classification—summarization approach.</p>
<sec id="Sec6">
<title>Epidemics</title>
<p id="Par24">We collect the crisis-related messages using AIDR platform (Imran et al.
<xref ref-type="bibr" rid="CR20">2014</xref>
) from Twitter posted during two recent epidemics —
<list list-type="order">
<list-item>
<p id="Par25">
<bold>Ebola:</bold>
This dataset consists of 5.08 million messages posted between August 6th, 2014 and January 19th, 2015 obtained using different keywords (e.g., #Ebola, ⋯).</p>
</list-item>
<list-item>
<p id="Par26">
<bold>MERS:</bold>
This dataset is collected during Middle East Respiratory Syndrome (MERS) outbreak, which consists of 0.215 million messages posted between April 27th and July 16th, 2014 obtained using different keywords (e.g., #MERS, ⋯)</p>
</list-item>
</list>
</p>
<p id="Par27">First, we remove non-English tweets using the language information provided by Twitter. After this step, we got around 200K tweets for MERS which were collected over a period of two and half months. However, tweets for Ebola were collected over a period of six months and we observe that most of the tweets (around 80%) posted after first two months are just exact or near duplicates of tweets posted during the first two months. Hence, for consistency, we select the first 200,000 tweets in chronological order for both the datasets. We make the tweet-ids publicly available to the research community at
<ext-link ext-link-type="uri" xlink:href="http://cse.iitkgp.ac.in/~krudra/epidemic.html">http://cse.iitkgp.ac.in/~krudra/epidemic.html</ext-link>
.</p>
</sec>
<sec id="Sec7">
<title>Types of Tweets Posted During Epidemics</title>
<p id="Par28">As stated earlier, tweets posted during an epidemic event include disease-related tweets as well as non-disease tweets. We employ human volunteers to observe different categories of disease tweets and to annotate them (details in Section 
<xref rid="Sec8" ref-type="sec">4</xref>
). The disease categories identified by our volunteers (which agrees with prior works (Goodwin and Harabagiu
<xref ref-type="bibr" rid="CR15">2016</xref>
; Imran et al.
<xref ref-type="bibr" rid="CR21">2016</xref>
)) are as follows. Some example tweets of each category are shown in Table 
<xref rid="Tab1" ref-type="table">1</xref>
.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Examples of various types of disease tweets (which contribute to information about epidemic) and non-disease tweets</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Type</th>
<th align="left">Event</th>
<th align="left">Tweet text</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" colspan="3">Disease tweets (which contribute to information about epidemic)</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Ebola</td>
<td align="left">Early #ebola symptoms include fever headache body aches cough stomach pain</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left">vomiting and diarrhea</td>
</tr>
<tr>
<td align="left">Symptom</td>
<td align="left">MERS</td>
<td align="left">Middle east respiratory syndrome symptoms include cough fever can lead to</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left">pneumonia & kidney failure</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Ebola</td>
<td align="left">Ebola is a deadly disease prevent it today drink / bath with salty warm water</td>
</tr>
<tr>
<td align="left">Prevention</td>
<td align="left">MERS</td>
<td align="left">#mers prevention tip 3/5—avoid touching your eyes nose and mouth with</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left">unwashed hands</td>
</tr>
<tr>
<td align="left">Disease</td>
<td align="left">Ebola</td>
<td align="left">Airborne cdc now confirms concerns of airborne transmission of ebola</td>
</tr>
<tr>
<td align="left">transmission</td>
<td align="left">MERS</td>
<td align="left">World health a camel reasons corona virus transmission</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Ebola</td>
<td align="left">Dozens flock to new liberia ebola treatment center new liberia ebola treatment</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left">center receives more than 100</td>
</tr>
<tr>
<td align="left">Treatment</td>
<td align="left">MERS</td>
<td align="left">cn-old drugs tested to fight new disease mers</td>
</tr>
<tr>
<td align="left">Death</td>
<td align="left">Ebola</td>
<td align="left">The largest #ebola outbreak on record has killed 4000 +</td>
</tr>
<tr>
<td align="left">report</td>
<td align="left">MERS</td>
<td align="left">Saudia Arabia reports 102 deaths from mers disease</td>
</tr>
<tr>
<td align="left" colspan="3">Non-disease tweets</td>
</tr>
<tr>
<td align="left">Not</td>
<td align="left">Ebola</td>
<td align="left">lies then he came to attack nigeria with ebola disease what is govt doing about</td>
</tr>
<tr>
<td align="left">relevant</td>
<td align="left"></td>
<td align="left">that too</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">MERS</td>
<td align="left">good question unfortunately i have not the answer but something to investigate</td>
</tr>
<tr>
<td align="left"></td>
<td align="left"></td>
<td align="left">fomites #mers</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<sec id="FPar4">
<title>
<bold>Disease-Related Tweets</bold>
</title>
<p id="Par29">Tweets which contain disease related information are primarily of the following five types: (i)
<italic>Symptom</italic>
– reports of symptoms such as fever, cough, diarrhea, and shortness of breath or questions related to these symptoms. (ii)
<italic>Prevention</italic>
– questions or suggestions related to the prevention of disease or mention of a new prevention strategy. (iii)
<italic>Disease transmission</italic>
– reports of disease transmission or questions related to disease transmission. (iv)
<italic>Treatment</italic>
– questions or suggestions regarding the treatments of the disease. (v)
<italic>Death report</italic>
– reports of affected people due to the disease.</p>
</sec>
<sec id="FPar5">
<title>
<bold>Non-disease Tweets</bold>
</title>
<p id="Par30">Non-disease tweets do not contribute to disease awareness and mostly contain sentiment/opinion of people.</p>
<p id="Par31">In this work, we try to extract information for both primary and secondary health care services. Symptom, prevention, and transmission classes are relevant to primary health care (vulnerable population) and information about treatment is necessary for secondary health care service (affected population). Finally, reports of dead and affected people are important for government and monitoring agencies.</p>
<p id="Par32">The next two sections discuss our proposed methodology comprising of first categorizing disease-related information (Section 
<xref rid="Sec8" ref-type="sec">4</xref>
), and then summarizing information in each category (Section 
<xref rid="Sec13" ref-type="sec">5</xref>
).</p>
</sec>
</sec>
</sec>
<sec id="Sec8">
<title>Classification of Tweets</title>
<p id="Par33">As stated earlier, in this section we try to classify tweets posted during epidemic into following classes—(i) Symptom, (ii) Prevention, (iii) Transmission, (iv) Treatment, (v) Death report, and (vi) Non-disease. We follow a supervised classification approach for which we need a gold standard of labeled tweets.</p>
<sec id="Sec9">
<title>Gold Standard</title>
<p id="Par34">For training the classifier, we consider 2000 randomly selected tweets (after removing duplicates and retweets) related to both the events. Three human volunteers independently observe the tweets, deciding whether they contribute to information about epidemic.
<xref ref-type="fn" rid="Fn1">1</xref>
We obtain unanimous agreement (i.e., all three volunteers assign same label to a tweet) for 87% of the tweets. For rest of the tweets, we follow majority verdict. Non-disease category contains greater number of tweets as compared to tweets present in individual disease related classes. Hence, we discard the large number of extra tweets present in
<italic>non-disease</italic>
for tackling class imbalance. Table 
<xref rid="Tab2" ref-type="table">2</xref>
shows the number of tweets in the gold standard finally created.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Number of tweets present in different classes</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Event</th>
<th align="left">Symptom</th>
<th align="left">Prevention</th>
<th align="left">Transmission</th>
<th align="left">Treatment</th>
<th align="left">Death report</th>
<th align="left">Non-disease</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Ebola</td>
<td align="left">52</td>
<td align="left">69</td>
<td align="left">65</td>
<td align="left">59</td>
<td align="left">51</td>
<td align="left">56</td>
</tr>
<tr>
<td align="left">MERS</td>
<td align="left">105</td>
<td align="left">70</td>
<td align="left">77</td>
<td align="left">74</td>
<td align="left">68</td>
<td align="left">84</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec10">
<title>Classification Features</title>
<p id="Par36">We aim to build a classifier which can be trained over tweets posted during past disease outbreaks and can directly be used over tweets posted for future epidemics. Earlier Rudra et al. (
<xref ref-type="bibr" rid="CR31">2015</xref>
) showed that low level lexical features are useful in developing event independent classifier and they can outperform vocabulary based approaches. Hence, we take the approach of using a set of event independent lexical and syntactic features for the classification task.</p>
<p id="Par37">A disease independent classification of tweets requires lexical resources which provide domain knowledge and associated terms. In this work, we consider large medical knowledgebase, Unified Medical Language System (UMLS) (Bodenreider
<xref ref-type="bibr" rid="CR3">2004</xref>
). It comprises over 3 million concepts (virus, flu etc.), each of which is assigned to more than one of the 134 semantic types. Next, MetaMap (Aronson
<xref ref-type="bibr" rid="CR1">2001</xref>
) is used for mapping texts to UMLS concepts. For example, if MetaMap is applied over the tweet ‘Cover your mouth and wear gloves there is a mers warning’ then we get following set of word-concept type pairs—1. cover-
<bold>Medical device</bold>
, 2. mouth-
<bold>Body space</bold>
, 3. mers-
<bold>Disease or syndrome</bold>
, 4. gloves-
<bold>Manufactured object</bold>
, and 5. warning-
<bold>Regulatory activity</bold>
. As mentioned in Section 
<xref rid="Sec2" ref-type="sec">2</xref>
, MetaMap does not perform well in case of short, informal texts. Hence, raw tweets have to pass through some preprocessing phases so that MetaMap can be applied over processed set of tweets. The preprocessing steps are described below.
<list list-type="order">
<list-item>
<p id="Par38">We remove unnecessary words (URLs, mentions, hashtag signs, emoticons, punctuation, and other Twitter specific tags) from the tweets. We use a Twitter-specific part-of-speech (POS) tagger (Gimpel et al.
<xref ref-type="bibr" rid="CR14">2011</xref>
) to identify POS tags for each word in the tweet. Along with normal POS tags (nouns, verbs, etc.), this tagger also labels Twitter-specific elements such as emoticons, retweets, URLs, and so on.</p>
</list-item>
<list-item>
<p id="Par39">We only consider words which are formal English words and present in an English dictionary (Aspell-python). We also remove out-of-vocabulary words commonly used in social media (Maity et al.
<xref ref-type="bibr" rid="CR26">2016</xref>
).</p>
</list-item>
<list-item>
<p id="Par40">MetaMap is originally designed to work for formal medical texts. In case of general texts (tweets), we observe that many common words (‘i’, ‘not’, ‘from’) are mapped to certain medical concepts. For example, in the tweet ‘concern over ontario patient from nigeria with flu symptoms via', ‘from’ and ‘to’ are marked as
<italic>qualitative concept (qlco)</italic>
. Hence, we remove all the stopwords from tweets.</p>
</list-item>
</list>
</p>
<p id="Par41">After preprocessing, tweets are passed as input to MetaMap which returns the set of tokens present in the tweet as concepts of UMLS Metathesaurus along with their corresponding semantic type. Finally, semantic types obtained from MetaMap are utilized for finding the relevant features. Table 
<xref rid="Tab3" ref-type="table">3</xref>
lists the classification features (binary).
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Lexical features used to classify tweets across different classes</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Feature</th>
<th align="left">Explanation</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Presence of</td>
<td align="left">We check if a concept (‘phsf’, ‘sosy’) related to symptoms is present in the</td>
</tr>
<tr>
<td align="left">sign/symptoms</td>
<td align="left">tweet. Expected to be higher in symptom related tweets. The semantic types</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">which indicate the presence of such term are Sign or Symptom (‘sosy’);</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Physiologic Function (‘phsf’))</td>
</tr>
<tr>
<td align="left">Presence of preventive</td>
<td align="left">Concepts related to preventive procedures (‘topp’) mostly present</td>
</tr>
<tr>
<td align="left">procedures</td>
<td align="left">in preventive category tweets</td>
</tr>
<tr>
<td align="left">Presence of anatomy</td>
<td align="left">Preventive procedures sometimes indicate taking care of certain parts of body.</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">This feature identifies the presence of terms related to body system,</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">substance, junction, body part, organ, or organ Component. Concepts like</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">‘bdsu’, ‘blor’, ‘bpoc’ are present in tweets describing anatomical structures</td>
</tr>
<tr>
<td align="left">Presence of preventive</td>
<td align="left">Terms like ‘preventive’, ‘prevention’ etc. indicates tweets containing</td>
</tr>
<tr>
<td align="left">terms</td>
<td align="left">information about preventive mechanism</td>
</tr>
<tr>
<td align="left">Presence of transmission</td>
<td align="left">Terms like ‘transmission’, ‘spread’ mostly present in tweets related to disease</td>
</tr>
<tr>
<td align="left">terms</td>
<td align="left">transmission</td>
</tr>
<tr>
<td align="left">Presence of treatment terms</td>
<td align="left">Terms like ‘treating’, ‘treatment’ mostly present in tweets related to treatment</td>
</tr>
<tr>
<td align="left">Presence of death terms</td>
<td align="left">Tweets related to dead people contains terms like ‘die’, ‘kill’, ‘death’ etc</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec11">
<title>Performance</title>
<p id="Par42">We compare the performance of our proposed set of lexical features with a standard bag-of-words (BOW) model similar to that in Imran et al. (
<xref ref-type="bibr" rid="CR20">2014</xref>
) where unigrams are considered as features. We remove (URLs, mentions, hashtag signs, emoticons, punctuation, stopwords, and other Twitter-specific tags) from the tweets using Twitter pos tagger (Gimpel et al.
<xref ref-type="bibr" rid="CR14">2011</xref>
).</p>
<sec id="FPar6">
<title>
<bold>Model Selection</bold>
</title>
<p id="Par43">For this experiment, we consider four state-of-the-art classification models from Scikit-learn package (Pedregosa et al.
<xref ref-type="bibr" rid="CR29">2011</xref>
)—(i). Support Vector Machine (SVM) classifier with the default RBF kernel and gamma = 0.5, (ii). SVM classifier with linear kernel and
<italic>l</italic>
2 optimizer, (iii). Logistic regression, and (iv). Naive-Bayes classifier. SVM classifier with RBF kernel outperforms other classification models when our proposed set of features are used for training and Logistic regression model shows best performance where unigrams are considered as features. Hence, we take following two classification models for rest of the study.</p>
<p id="Par44">We compare the performance of the two feature-sets under two different scenarios (i) in-domain classification, where the tweets of same disease are used to train and test the classifier using 10-fold cross validation, and (ii) cross-domain classification, where the classifier is trained with tweets of one disease, and tested on another disease. Table 
<xref rid="Tab4" ref-type="table">4</xref>
shows the accuracies of the classifier using bag-of-words model (BOW) and the proposed features (PRO) on the tweets.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>Classification accuracies of tweets, using (i) bag-of-words features (BOW), (ii) proposed features (PRO). Diagonal entries are for in-domain classification, while the non-diagonal entries are for cross-domain classification. Values in the bracket represent standard deviations in case of in-domain accuracies</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Train set</th>
<th align="left" colspan="4">Test set</th>
</tr>
<tr>
<th align="left"></th>
<th align="left" colspan="2">Ebola</th>
<th align="left" colspan="2">MERS</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">BOW</th>
<th align="left">PRO</th>
<th align="left">BOW</th>
<th align="left">PRO</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Ebola</td>
<td align="left">
<bold>
<italic>84.78% (0.05)</italic>
</bold>
</td>
<td align="left">
<italic>84.02% (0.06)</italic>
</td>
<td align="left">65.69%</td>
<td align="left">
<bold>76.15%</bold>
</td>
</tr>
<tr>
<td align="left">MERS</td>
<td align="left">66.19%</td>
<td align="left">
<bold>74.72%</bold>
</td>
<td align="left">
<bold>
<italic>88.26%(0.07)</italic>
</bold>
</td>
<td align="left">
<italic>81.05% (0.03)</italic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>In-domain classification results are represented by italic entries. For each train-test pair, the accuracy of better performing system has been boldfaced</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="FPar7">
<title>
<bold>In-domain Classification</bold>
</title>
<p id="Par45">BOW model performs well in the case of in-domain classification (diagonal entries in Table 
<xref rid="Tab4" ref-type="table">4</xref>
) due to uniform vocabulary used during a particular event. However, performance of the proposed lexical features is at par with the bag-of-words model.</p>
</sec>
<sec id="FPar8">
<title>
<bold>Cross-Domain Classification</bold>
</title>
<p id="Par46">The non-diagonal entries of Table 
<xref rid="Tab4" ref-type="table">4</xref>
represent the accuracies, where the event stated on the left-hand side of the table represents the training event, and the event stated at the top represents the test event. The proposed model performs better than the BOW model in such scenarios, since it is independent of the vocabulary of specific events. For cross-domain classification, we also measure precision, recall, F-score of classification for both sets of features. In order to take care of class imbalance, we consider weighted measure for precision, recall, and F-score. Table 
<xref rid="Tab5" ref-type="table">5</xref>
shows recall, and F-score for each set of features where left hand side represents training event and right hand side represents test event. Our proposed set of features achieve high recall and f-score compared to bag-of-words model which indicates low level lexical features can show promising performance in classifying tweets posted during future epidemics.
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>Recall (F-score) of tweets, using (i) bag-of-words features (BOW), (ii) proposed features (PRO)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Train set</th>
<th align="left" colspan="4">Test set</th>
</tr>
<tr>
<th align="left"></th>
<th align="left" colspan="2">Ebola</th>
<th align="left" colspan="2">MERS</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">BOW</th>
<th align="left">PRO</th>
<th align="left">BOW</th>
<th align="left">PRO</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Ebola</td>
<td align="left">
<italic>0.84(0.85)</italic>
</td>
<td align="left">
<italic>0.84(0.84)</italic>
</td>
<td align="left">0.65(0.66)</td>
<td align="left">0.76(0.76)</td>
</tr>
<tr>
<td align="left">MERS</td>
<td align="left">0.66(0.65)</td>
<td align="left">0.75(0.75)</td>
<td align="left">
<italic>0.88(0.88)</italic>
</td>
<td align="left">
<italic>0.81(0.81)</italic>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>In-domain classification results are represented by italic entries. For each train-test pair, the accuracy of better performing system has been boldfaced</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
</sec>
<sec id="Sec12">
<title>Analyzing Misclassified Tweets</title>
<p id="Par47">From Table 
<xref rid="Tab4" ref-type="table">4</xref>
, it is clear that in cross-domain scenario around 25% tweets are misclassified. In this part, we analyze different kind of errors present in the data and also identify the reasons behind such misclassification. We observe that in most of the cases, tweets from ‘symptom’, ‘prevention’, and ‘transmission’ classes are incorrectly tagged as ‘non-disease’ due to absence of the features presented in Table 
<xref rid="Tab3" ref-type="table">3</xref>
. When we train our proposed model using Ebola dataset and test it over MERS, tweets belonging to symptom, prevention, and disease transmission classes are misclassified as non-disease in 12%, 13% and 8% of the cases respectively. A few pair of classes like ‘symptoms’ and ‘prevention’, ‘transmission’ and ‘prevention’, etc. are inter-related. In these pairs, people often use information from one class in order to derive information for the other class. Thus, we find simultaneous use of multiple classes in the same tweet. In such cases, classifier is confused and selects a label arbitrarily. Table 
<xref rid="Tab6" ref-type="table">6</xref>
shows examples of misclassified tweets, with their true and predicted labels. In most of the cases, we need some features which can discriminate between two closely related classes. In future, we will try to incorporate more low-level lexical features to improve classification accuracy.
<table-wrap id="Tab6">
<label>Table 6</label>
<caption>
<p>Examples of misclassified tweets</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Tweet</th>
<th align="left">True class</th>
<th align="left">Predicted class</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Worried about the #mers #virus here are 10 ways to boost your body’s</td>
<td align="left">Prevention</td>
<td align="left">Not relevant</td>
</tr>
<tr>
<td align="left">immune system to fight disease #health</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">The truth is that #coronavirus #mers can transmit between humans we</td>
<td align="left">Prevention</td>
<td align="left">Disease</td>
</tr>
<tr>
<td align="left">think not as well as flu but protect yourself anyway wash hands 24/7</td>
<td align="left"></td>
<td align="left">transmission</td>
</tr>
<tr>
<td align="left">From on mers-cov wash your hands cover your coughs and sneezes</td>
<td align="left">Prevention</td>
<td align="left">Symptom</td>
</tr>
<tr>
<td align="left">and stay home if you are sick</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Learn more about #mers the virus that causes it how it spreads symptoms</td>
<td align="left">Symptom</td>
<td align="left">Prevention</td>
</tr>
<tr>
<td align="left">prevention tips & amp what cdc is doing</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Wash your hands folks and keep your areas clean mers-middle east</td>
<td align="left">Prevention</td>
<td align="left">Death reports</td>
</tr>
<tr>
<td align="left">respiratory syndrome 1/3 of the people who get this dies</td>
<td align="left"></td>
<td align="left"></td>
</tr>
<tr>
<td align="left">#mers is not as contagious as the flu says #infectiousdisease expert via</td>
<td align="left">Disease</td>
<td align="left">Not relevant</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">transmission</td>
<td align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
</sec>
<sec id="Sec13">
<title>Summarization</title>
<p id="Par48">Given the automatically classified tweets into different disease classes (described in previous section), in this section we aim to provide a cohesive summary of each class. The type of information and its representation to end-users that should be extracted from each category varies.
<xref ref-type="fn" rid="Fn2">2</xref>
For instance, in the case of the ‘symptom’ category, two lists of symptoms are required (i) positive symptoms list (i.e. actual symptoms of a disease) (ii) negative symptoms list (i.e. symptoms which are not yet confirmed as actual symptoms of a disease). However, in the case of the ‘prevention’ category, instead of generating lists, we aim to summarize prevention strategies. Next, we describe different summarization techniques followed for different categories.</p>
<sec id="Sec14">
<title>Summarizing Symptoms</title>
<p id="Par50">To automatically extract positive and negative symptoms from the tweets classified into the symptom category, we first generate a symptoms dictionary. For this purpose, we extract symptoms listed on various credible online sources like Wikipedia,
<xref ref-type="fn" rid="Fn3">3</xref>
MedicineNet,
<xref ref-type="fn" rid="Fn4">4</xref>
Healthline
<xref ref-type="fn" rid="Fn5">5</xref>
etc. Our dictionary contains around 770 symptoms.</p>
<sec id="FPar9">
<title>
<bold>Symptom Identification</bold>
</title>
<p id="Par54">Now, given a tweet
<italic>t</italic>
, we check if it contains a symptom from the symptom dictionary. If a symptom
<italic>s</italic>
is found in
<italic>t</italic>
, then there can be two possibilities:
<list list-type="order">
<list-item>
<p id="Par55">
<italic>Positive symptom</italic>
: The user who posted tweet
<italic>t</italic>
might be reporting that symptom
<italic>s</italic>
would be observed in a user if she is affected by the ongoing epidemic. Eg. ‘symptoms of MERS include
<italic>fever</italic>
and
<italic>shortness of breath</italic>
.’</p>
</list-item>
<list-item>
<p id="Par56">
<italic>Negative symptom</italic>
: The user who posted tweet
<italic>t</italic>
might be conveying that symptom
<italic>s</italic>
would not be observed if a user is affected by the ongoing epidemic. Eg. ‘#Ebola symptoms are different than upper respiratory tract pathogens,
<italic>no cough, nasal congestion</italic>
Dr. Wilson.’</p>
</list-item>
</list>
</p>
<p id="Par57">We distinguish between the above two cases by using the terms having dependencies with the symptom term. We check if symptom
<italic>s</italic>
has a dependency (Kong et al.
<xref ref-type="bibr" rid="CR23">2014</xref>
) with any strongly negative term in the tweet
<italic>t</italic>
. Symptom
<italic>s</italic>
is a
<italic>negative symptom</italic>
of the disease if
<italic>s</italic>
has dependency with atleast one strongly negative term in
<italic>t</italic>
. If there is no such dependency with any negative term, then symptom
<italic>s</italic>
is a
<italic>positive symptom</italic>
of disease. We use Christopher Potts’ sentiment tutorial
<xref ref-type="fn" rid="Fn6">6</xref>
to identify strongly negative terms (e.g., never, no, wont) and Twitter dependency parser to identify the dependency relations present in a tweet. For example, in case of the tweet ‘CDC announces second case of MERS virus.’, the dependency tree returns following six relations — (CDC, announces), (case, announces), (second, case), (of, case), (MERS, virus), (virus, of). Table 
<xref rid="Tab7" ref-type="table">7</xref>
shows examples of some positive and negative symptoms. After identifying both positive and negative symptoms, we try to rank them on the basis of their
<italic>corpus frequency</italic>
i.e., number of tweets in the corpus (symptom class) in which the symptom has been reported. However, the same symptom
<italic>s</italic>
might occur in multiple tweets. If a symptom
<italic>s</italic>
is found as a
<italic>positive symptom</italic>
in one tweet and also captured as a
<italic>negative symptom</italic>
in another tweet, then
<italic>s</italic>
is considered as ambiguous. Next, we describe the method to deal with such ambiguous symptoms.
<table-wrap id="Tab7">
<label>Table 7</label>
<caption>
<p>Sample tweets posted during outbreak containing symptoms in positive and negative context</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Context</th>
<th align="left">Tweet</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"></td>
<td align="left">#Ebola symptoms: fever, headache, muscle aches, weakness, no appetite,</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">stomach pain, vomiting, diarrhea & bleeding</td>
</tr>
<tr>
<td align="left">Positive</td>
<td align="left">RT @NTANewsNow: Ebola symptoms starts as malaria or cold then vomiting,</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">weakness, Joint & Muscle Ache, Stomach pain and Lack of Appetite</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">#Ebola symptoms are different than upper respiratory tract pathogens, no</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">cough, nasal congestion Dr. Wilson</td>
</tr>
<tr>
<td align="left">Negative</td>
<td align="left">I’ve been informed that coughing is not a symptom of Ebola</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="FPar10">
<title>
<bold>Removing Conflicting Symptoms</bold>
</title>
<p id="Par59">In this work, we are primarily interested in
<italic>positive symptoms</italic>
of a disease i.e. symptoms which represent that disease. As identified earlier, many ambiguous symptoms may occur in both positive and negative lists. However, the frequency of occurrence of a particular symptom
<italic>s</italic>
is not likely to be the same for both positive and negative classes. Hence, we compute the ratio of positive to negative corpus frequency of a particular ambiguous symptom
<italic>s</italic>
. If that ratio is ≤ 1, then we drop that symptom
<italic>s</italic>
from positive list.</p>
</sec>
</sec>
<sec id="Sec15">
<title>Summarizing Disease Transmissions</title>
<p id="Par60">During epidemics, vulnerable users look for information about possible disease transmission mediums so that precautionary steps can be taken. Common users and health organizations post tweets regarding possible transmission possibilities of a disease for public awareness. It is observed that information about transmission mediums is mostly centered around keywords like ‘transmission’, ‘transmit’ etc. In this work, we use following set of transmission related keywords—(i). transmission, (ii). transmit, (iii). transference, (iv). transferral, (v). dissemination, (vi). diffusion, (vii). emanation, (viii). channeling, (ix). spread, (x). transfer, (xi). relay.</p>
<p id="Par61">To identify informative components centered around such keywords, we explore the dependency relation among the words in a tweet using a
<italic>dependency tree</italic>
(Kong et al.
<xref ref-type="bibr" rid="CR23">2014</xref>
). A dependency tree basically indicates the relation among different words present in a tweet. For example, dependency tree for the tweet ‘Ebola virus could be transmitted via infectious aerosol’ contains the following two dependency relations centered around keyword ‘transmit’– (via, transmit), (aerosol, transmit). In general, the POS tag of every transmission medium will be ‘Noun’ (eg. ‘aerosol’ in the previous example). Hence, we detect all nouns connected to keywords in the dependency tree within a 2-hop distance.</p>
<p id="Par62">It is observed that in some cases people post information about mediums not responsible for disease transmission. Table 
<xref rid="Tab8" ref-type="table">8</xref>
shows example tweets providing information about transmission mediums in both positive and negative direction. To capture the actual intent of a message, we detect whether any negated context is associated with the keywords or not (same as proposed in symptom detection in Section 
<xref rid="Sec14" ref-type="sec">5.1</xref>
). Finally, we rank the transmission mediums based on their
<italic>corpus frequency</italic>
, i.e., number of tweets in the transmission class in which they occur and remove ambiguous mediums (present in both positive and negative list) based on the ratio of their frequency of occurrence in positive and negative context (Section 
<xref rid="Sec14" ref-type="sec">5.1</xref>
).
<table-wrap id="Tab8">
<label>Table 8</label>
<caption>
<p>Sample tweets posted during outbreak containing information about transmission mediums in positive and negative context</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Context</th>
<th align="left">Tweet</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"></td>
<td align="left">@USER @USER @USER I’ve also read that Ebola can spread thru airborne</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">transmission [url]</td>
</tr>
<tr>
<td align="left">Positive</td>
<td align="left">#Ebola virus could be transmitted via infectious aerosol particles</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">Idiots & liars! @USER WH briefing: “Ebola is not like the flu. #Ebola is</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<bold>not transmitted</bold>
through the air.” [url]</td>
</tr>
<tr>
<td align="left">Negative</td>
<td align="left">RT @USER: CDc: You must have personal contact to contract #Ebola. It</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">is
<bold>not transmitted</bold>
by airborn route</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec16">
<title>Summarizing Prevention Information</title>
<p id="Par63">Users vulnerable to a disease are primarily looking for preventive measures. To provide a summary of those preventive measures, we take tweets categorized as ‘preventive’ by our classifier and some specific types of
<bold>preventive terms</bold>
which provide important information about preventive measures in epidemic scenarios—(i) Therapeutic or preventive procedure, (ii) symptom words, (iii). anatomy words (terms related to terms related to body system, substance, junction, body part etc). We extract these preventive terms from tweets using Metamap and UMLs knowledge bases. </p>
<p id="Par64">Considering that the important preventive information in an epidemic is often centered around
<bold>preventive terms</bold>
, we can achieve a good coverage of preventive information in a summary by optimizing the coverage of such important preventive terms. In order to capture preventive terms, we extract prevention (‘Therapeutic or preventive procedure’), anatomy(‘Body location or region’, ‘Body substance’, ‘Body part, organ, or organ component’), daily activity related concepts and terms using Metamap and UMLS. The importance of a preventive term is computed based on its frequency of occurrence in the corpus i.e., number of times a term
<italic>t</italic>
is present in the corpus of preventive tweets.</p>
<p id="Par65">To generate a summary from the tweets in this category, we use an Integer Linear Programming (ILP)-based technique (Rudra et al.
<xref ref-type="bibr" rid="CR31">2015</xref>
) to optimize the coverage of the preventive terms. Table 
<xref rid="Tab9" ref-type="table">9</xref>
states the notations used.
<table-wrap id="Tab9">
<label>Table 9</label>
<caption>
<p>Notations used in the summarization technique</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Notation</th>
<th align="left">Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<italic>L</italic>
</td>
<td align="left">Desired summary length (number of words)</td>
</tr>
<tr>
<td align="left">
<italic>n</italic>
</td>
<td align="left">Number of tweets considered for summarization (in the</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">time window specified by user)</td>
</tr>
<tr>
<td align="left">
<italic>m</italic>
</td>
<td align="left">Number of distinct content words included in the
<italic>n</italic>
tweets</td>
</tr>
<tr>
<td align="left">
<italic>i</italic>
</td>
<td align="left">Index for tweets</td>
</tr>
<tr>
<td align="left">
<italic>j</italic>
</td>
<td align="left">Index for preventive terms</td>
</tr>
<tr>
<td align="left">
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
</td>
<td align="left">Indicator variable for tweet
<italic>i</italic>
(1 if tweet
<italic>i</italic>
should be</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">included in summary, 0 otherwise)</td>
</tr>
<tr>
<td align="left">
<italic>y</italic>
<sub>
<italic>j</italic>
</sub>
</td>
<td align="left">Indicator variable for preventive term
<italic>j</italic>
</td>
</tr>
<tr>
<td align="left">
<italic>L</italic>
<italic>e</italic>
<italic>n</italic>
<italic>g</italic>
<italic>t</italic>
<italic>h</italic>
(
<italic>i</italic>
) </td>
<td align="left">Number of words present in tweet
<italic>i</italic>
</td>
</tr>
<tr>
<td align="left">Score(
<italic>j</italic>
)</td>
<td align="left">cf score of preventive term
<italic>j</italic>
</td>
</tr>
<tr>
<td align="left">
<italic>T</italic>
<sub>
<italic>j</italic>
</sub>
</td>
<td align="left">Set of tweets where content word
<italic>j</italic>
is present</td>
</tr>
<tr>
<td align="left">
<italic>P</italic>
<sub>
<italic>i</italic>
</sub>
</td>
<td align="left">Set of preventive terms present in tweet
<italic>i</italic>
</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par66">The summarization is achieved by optimizing the following ILP objective function:
<disp-formula id="Equ1">
<label>1</label>
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ max(\lambda_{1}.\sum\limits_{i = 1}^{n}x_{i} \, + \, \lambda_{2}.\sum\limits_{j = 1}^{m} Score(j).y_{j}) $$\end{document}</tex-math>
<mml:math id="M2">
<mml:mi>max</mml:mi>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mn>.</mml:mn>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.3em"></mml:mspace>
<mml:mo>+</mml:mo>
<mml:mspace width="0.3em"></mml:mspace>
<mml:msub>
<mml:mrow>
<mml:mi>λ</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mn>.</mml:mn>
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:mi>Score</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>)</mml:mo>
<mml:mn>.</mml:mn>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:math>
<graphic xlink:href="10796_2018_9844_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
subject to the constraints
<disp-formula id="Equ2">
<label>2</label>
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{array}{@{}rcl@{}} &&\sum\limits_{i = 1}^{n} x_{i} \cdot Length(i) \leq L \end{array} $$\end{document}</tex-math>
<mml:math id="M4">
<mml:mtable class="eqnarray" columnalign="right center left">
<mml:mtr>
<mml:mtd class="eqnarray-1"></mml:mtd>
<mml:mtd class="eqnarray-2"></mml:mtd>
<mml:mtd class="eqnarray-3">
<mml:munderover accent="false" accentunder="false">
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mi mathvariant="italic">Length</mml:mi>
<mml:mo>(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>)</mml:mo>
<mml:mo></mml:mo>
<mml:mi>L</mml:mi>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="10796_2018_9844_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
<disp-formula id="Equ3">
<label>3</label>
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{array}{@{}rcl@{}} &&\sum\limits_{i\in T_{j} } x_{i} \geq y_{j}, j=[1 {\cdots} m] \end{array} $$\end{document}</tex-math>
<mml:math id="M6">
<mml:mtable class="eqnarray" columnalign="right center left">
<mml:mtr>
<mml:mtd class="eqnarray-1"></mml:mtd>
<mml:mtd class="eqnarray-2"></mml:mtd>
<mml:mtd class="eqnarray-3">
<mml:munder>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>]</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="10796_2018_9844_Article_Equ3.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
<disp-formula id="Equ4">
<label>4</label>
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{array}{@{}rcl@{}} &&\sum\limits_{j\in P_{i}} y_{j} \geq |P_{i}| \times x_{i}, i=[1 {\cdots} n] \end{array} $$\end{document}</tex-math>
<mml:math id="M8">
<mml:mtable class="eqnarray" columnalign="right center left">
<mml:mtr>
<mml:mtd class="eqnarray-1"></mml:mtd>
<mml:mtd class="eqnarray-2"></mml:mtd>
<mml:mtd class="eqnarray-3">
<mml:munder>
<mml:mrow>
<mml:mi></mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo></mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo></mml:mo>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>|</mml:mo>
<mml:mo>×</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo>[</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo></mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>]</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
<graphic xlink:href="10796_2018_9844_Article_Equ4.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where the symbols are as explained in Table 
<xref rid="Tab9" ref-type="table">9</xref>
. The objective function considers both the number of tweets included in the summary (through the
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
variables) as well as the number of important preventive-terms (through the
<italic>y</italic>
<sub>
<italic>j</italic>
</sub>
variables) included. The constraint in Eq. 
<xref rid="Equ2" ref-type="">2</xref>
ensures that the total number of words contained in the tweets that get included in the summary are at most of the desired length
<italic>L</italic>
(user-specified) while the constraint in Eq. 
<xref rid="Equ3" ref-type="">3</xref>
ensures that if the preventive term
<italic>j</italic>
is selected to be included in the summary, i.e., if
<italic>y</italic>
<sub>
<italic>j</italic>
</sub>
= 1, then at least one tweet in which this preventive term is present is selected. Similarly, the constraint in Eq. 
<xref rid="Equ4" ref-type="">4</xref>
ensures that if a particular tweet is selected to be included in the summary, then the preventive terms in that tweet are also selected. In this summarization process, our objective is to capture more number of preventive terms rather than the number of tweets. Hence,
<italic>λ</italic>
<sub>1</sub>
and
<italic>λ</italic>
<sub>2</sub>
are set to 0 and 1 respectively.</p>
<p id="Par67">We use GUROBI Optimizer (Gurobi
<xref ref-type="bibr" rid="CR16">2015</xref>
) to solve the ILP. After solving this ILP, the set of tweets
<italic>i</italic>
such that
<italic>x</italic>
<sub>
<italic>i</italic>
</sub>
= 1, represent the summary.</p>
</sec>
<sec id="Sec17">
<title>Summarizing Death Reports</title>
<p id="Par68">During such epidemic apart from health related issues some socio-political matters also arise because travelers from foreign nations also get affected due to the ongoing epidemic and sometimes local government has to arrange necessary equipment for their treatment as well as send them back to their countries. Local residents suffering from the epidemic also need support from government and health agencies. Under such constraints government generally keeps track of number of people dead or under treatment. In this part, we try to extract this kind of information snippet from large set of tweets which may help government to get a quick snapshot of the situation. </p>
<p id="Par69">Primarily, we observe that such information is centered around keywords like ‘died’, ‘killed’, ‘dead’, ’death’, ‘expire’, ‘demise’ etc. Table 
<xref rid="Tab10" ref-type="table">10</xref>
shows some examples of the tweets present in the ‘death reports’ class. While prior work (Rudra et al.
<xref ref-type="bibr" rid="CR31">2015</xref>
) considered all nouns and verbs as content words, in reality, all such keywords present in a tweet are
<italic>not</italic>
linked to health related events. Hence, in the present work, we identify the keywords for ‘death reports’ class from manually annotated corpus. As illustrated in Table 
<xref rid="Tab10" ref-type="table">10</xref>
, tweets contain location-wise information about dead people. Hence, it is necessary to capture location information in final summary. For summarization of death reports, we follow same ILP framework proposed in Section 
<xref rid="Sec16" ref-type="sec">5.3</xref>
but instead of optimizing
<bold>preventive terms</bold>
, here we optimize the coverage of
<bold>death related terms</bold>
. We consider numerals (e.g., number of casualties), keywords related to death reports, and location information as
<bold>death related terms</bold>
. We use Twitter-specific part-of-speech (POS) tagger (Gimpel et al.
<xref ref-type="bibr" rid="CR14">2011</xref>
) to identify POS tags for each word in the tweet. From these POS tags we select numerals for the summarization. We collect keywords related to death reports from manually annotated tweets. To identify location information, we use various online sources.
<xref ref-type="fn" rid="Fn7">7</xref>
Finally, ILP method maximizes the coverage of death related terms.
<table-wrap id="Tab10">
<label>Table 10</label>
<caption>
<p>Sample tweets posted during outbreak containing information about killed or died people</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left">As of Oct. 15th 2014 CDC numbers for #Ebola are 8997 total cases,</td>
</tr>
<tr>
<td align="left">5006 laboratory-confirmed cases, and 4493 deaths in total</td>
</tr>
<tr>
<td align="left">RT @USER: New WHO numbers on #Ebola outbreak in 3 West</td>
</tr>
<tr>
<td align="left">African countries: 1440 ill including 826 deaths. (As of 7/30)</td>
</tr>
<tr>
<td align="left">#Ebola has infected almost 10,000 people this year, mostly in Sierra</td>
</tr>
<tr>
<td align="left">Leone, Guinea and Liberia, killing about 4900</td>
</tr>
<tr>
<td align="left">RT @USER: #Ebola: As of 4 Aug 2014, countries have reported</td>
</tr>
<tr>
<td align="left">1711 cases (1070 conf, 436 probable, 205 susp), incl 932 deaths</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec18">
<title>Summarizing Treatment Information</title>
<p id="Par71">Users who already get affected by the disease look for information about necessary medicines, treatment centers etc. For summarizing this information, we focus on tweets categorized as ‘treatment’ by our classifier and some specific types of
<bold>treatment terms</bold>
which provide important information about recovery procedure in epidemic scenario (i). clinical drug, (ii). pharmacologic substance (obtained from Metamap and UMLs). Table 
<xref rid="Tab11" ref-type="table">11</xref>
provides examples of tweets containing treatment or recovery information.
<table-wrap id="Tab11">
<label>Table 11</label>
<caption>
<p>Sample tweets posted during outbreak containing recovery information</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left">Fujifilm Drug Eyed As Possible Treatment For Ebola Virus</td>
</tr>
<tr>
<td align="left">@USER Guarded optimism - use of #HIV antiviral to treat #ebola.</td>
</tr>
<tr>
<td align="left">FDA-approved genital warts drug could treat #MERS</td>
</tr>
<tr>
<td align="left">RT @USER: DNA vaccine demonstrates potential to prevent and treat</td>
</tr>
<tr>
<td align="left">deadly MERS coronavirus: Inovio Pharmaceuticals</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par72">Considering that the important recovery information in an epidemic is often centered around treatment terms, a good coverage of recovery information can be achieved by optimizing the coverage of important
<bold>treatment terms</bold>
. The importance of a treatment-term is computed based on its frequency of occurrence in the corpus i.e. number of times a treatment-term
<italic>t</italic>
occurs in the corpus of treatment related tweets. For summarization of treatment information, we follow the same ILP framework proposed in Section 
<xref rid="Sec16" ref-type="sec">5.3</xref>
but instead of optimizing preventive terms, here we optimize the coverage of
<bold>treatment related terms</bold>
. </p>
<p id="Par73">We term our proposed MEDical dictionary based tweet SUMmmarization approach as
<bold>MEDSUM</bold>
. In the next section, we evaluate the performance of our proposed summarization models.</p>
</sec>
</sec>
<sec id="Sec19">
<title>Experimental Results</title>
<p id="Par74">In this section, we evaluate the performance of our proposed summarization techniques for different information classes (symptom, transmission, prevention, death information, treatment).</p>
<sec id="Sec20">
<title>Evaluation of Symptoms of a Disease</title>
<p id="Par75">In Section 
<xref rid="Sec14" ref-type="sec">5.1</xref>
, we propose an algorithm to identify the symptoms of a disease. We need gold standard list of symptoms to check the accuracy of our method. We extract the actual symptoms of a disease by using online sources (World Health Organization (WHO)
<xref ref-type="bibr" rid="CR38">2014</xref>
; Centers for Disease Control and Prevention
<xref ref-type="bibr" rid="CR4">2014</xref>
) and compare the output of our algorithm with the actual symptoms to compute precision score. The number of actual symptoms for Ebola and MERS is 20 and 5 respectively. Hence, our proposed method also extracts 20 and 5 symptoms for Ebola and MERS respectively.</p>
<p id="Par76">In case of Ebola, we observe that thirteen out of twenty symptoms are present in the gold standard list of symptoms. Three of the remaining seven symptoms are synonyms of some original symptom (present in the list of thirteen symptoms). Similarly, in case of MERS, three out of five symptoms are present in the gold standard list. Among the remaining two symptoms, one is synonym of some original symptom. Finally, our proposed method is able to extract sixteen and four original symptoms for Ebola and MERS respectively. Table 
<xref rid="Tab12" ref-type="table">12</xref>
shows the precision and recall of our proposed approach.
<table-wrap id="Tab12">
<label>Table 12</label>
<caption>
<p>Precision and recall of our symptom identification method</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Disease</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Ebola</td>
<td align="left">0.80</td>
<td align="left">0.65</td>
</tr>
<tr>
<td align="left">MERS</td>
<td align="left">0.80</td>
<td align="left">0.60</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par77">We observe that missed out symptoms (seven for Ebola and two for MERS) are identified at later stages of the disease which are not available in the tweets. Hence, we are not able to capture all the relevant symptoms but symptoms extracted from the tweets are able to give users an initial indication of the disease.</p>
</sec>
<sec id="Sec21">
<title>Evaluation of the Transmission Medium</title>
<p id="Par78">In Section 
<xref rid="Sec15" ref-type="sec">5.2</xref>
, we showed that users post information about both kinds of mediums i.e., mediums responsible for transmission (positive mediums) and mediums not responsible for transmission of the disease (negative mediums). Here, we are interested in positive transmission mediums i.e. mediums responsible for disease propagation. We extract the actual transmission mediums for both the diseases from online sources (World Health Organization (WHO)
<xref ref-type="bibr" rid="CR38">2014</xref>
; Centers for Disease Control and Prevention
<xref ref-type="bibr" rid="CR4">2014</xref>
) and compare the output of our algorithm with the actual transmission mediums to compute the precision and recall. We have collected fourteen and twelve transmission media for Ebola and MERS respectively. Table 
<xref rid="Tab13" ref-type="table">13</xref>
shows the precision and recall of the proposed algorithm for top 10 and 20 transmission mediums.
<table-wrap id="Tab13">
<label>Table 13</label>
<caption>
<p>Precision and recall of our transmission mediums detection method</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Disease</th>
<th align="left">#Mediums</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"></td>
<td align="left">10</td>
<td align="left">0.70</td>
<td align="left">0.53</td>
</tr>
<tr>
<td align="left">Ebola</td>
<td align="left">20</td>
<td align="left">0.65</td>
<td align="left">0.92</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">10</td>
<td align="left">0.50</td>
<td align="left">0.42</td>
</tr>
<tr>
<td align="left">MERS</td>
<td align="left">20</td>
<td align="left">0.40</td>
<td align="left">0.67</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p id="Par79">It is clear from Table 
<xref rid="Tab13" ref-type="table">13</xref>
that recall value will increase with more number of transmission mediums but precision goes down. However, many transmission mediums are identified at later stages which are not present in tweets posted during these epidemics. Still, it can provide a general overview about possible transmission mediums to the vulnerable people.</p>
</sec>
<sec id="Sec22">
<title>Evaluation of Prevention, Treatment Mechanisms and Death Reports</title>
<p id="Par80">In case of symptom and transmission, we extract a ranked list of words and phrases from the tweets of corresponding classes. On the other hand, we propose an ILP based summarization scheme for prevention, death report, and treatment category. This method selects a set of tweets as a representative summary of the corresponding class. To measure the quality of system generated summary, we have to prepare ground truth summaries and compare system summaries with those ground truth summaries.</p>
<sec id="Sec23">
<title>Experimental Settings</title>
<p id="Par81">In the next part, we explain the baseline, evaluation criteria and results for each of the three information classes (prevention, death report, and treatment).</p>
<sec id="FPar11">
<title>
<bold>Preparing Ground-Truth Summaries</bold>
</title>
<p id="Par82">For both the dataset and each of the information classes, three human volunteers (same as those involved in the classification stage) individually prepare summaries of length 200 words from the tweets of the corresponding class. To prepare the final ground truth summary of a particular disease and particular class, we first choose those tweets which are included in the individual summaries of all the volunteers, followed by those which are included by the majority of the volunteers. Thus, we create single ground truth summary of 200 words for each information class, for each dataset.</p>
</sec>
<sec id="FPar12">
<title>
<bold>Baseline</bold>
</title>
<p id="Par83">We compare the performance of our proposed summarization technique with disaster specific real time summarization technique COWTS proposed by Rudra et al. (Rudra et al.
<xref ref-type="bibr" rid="CR31">2015</xref>
).</p>
</sec>
<sec id="FPar13">
<title>
<bold>Evaluation Metric</bold>
</title>
<p id="Par84">We use the standard ROUGE (Lin
<xref ref-type="bibr" rid="CR24">2004</xref>
) metric for evaluating the quality of the summaries generated. Due to the informal nature of tweets, we actually consider the
<italic>recall and F-score</italic>
of the ROUGE-1 variant. Formally, ROUGE-1 recall is unigram recall between a candidate / system summary and a reference summary, i.e., how many unigrams of reference summary are present in the candidate summary normalized by the count of unigrams present in the reference summary. Similarly, ROUGE-1 precision is unigram precision between a candidate summary and a reference summary, i.e., how many unigrams of reference summary are present in the candidate/system summary normalized by the count of unigrams present in the candidate summary. Finally the F-score is computed as harmonic mean of recall and precision.</p>
<p id="Par85">Next, we show the performance of our proposed method for each of the information classes.</p>
</sec>
</sec>
<sec id="Sec24">
<title>Performance Comparison</title>
<sec id="FPar14">
<title>
<bold>Disease Prevention</bold>
</title>
<p id="Par86">Table 
<xref rid="Tab14" ref-type="table">14</xref>
gives the ROUGE-1 F-scores and recall values for both the algorithms respectively. It is clear that MEDSUM performs better compared to COWTS because disaster specific content words are not able to capture preventive information during disease outbreak.
<table-wrap id="Tab14">
<label>Table 14</label>
<caption>
<p>Comparison of ROUGE-1 recall and F-scores (Twitter-specific tags, emoticons, hashtags, mentions, urls, removed and standard rouge stemming(-m) and stopwords(-s) option) for MEDSUM (the proposed methodology) and the baseline method COWTS for prevention class</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Event</th>
<th align="left" colspan="2">MEDSUM</th>
<th align="left" colspan="2">COWTS</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">Recall</th>
<th align="left">F-score</th>
<th align="left">Recall</th>
<th align="left">F-score</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Ebola</td>
<td align="left">
<bold>0.4771</bold>
</td>
<td align="left">
<bold>0.5195</bold>
</td>
<td align="left">0.4575</td>
<td align="left">0.5109</td>
</tr>
<tr>
<td align="left">MERS</td>
<td align="left">
<bold>0.4898</bold>
</td>
<td align="left">
<bold>0.5393</bold>
</td>
<td align="left">0.4761</td>
<td align="left">0.4811</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>For each evaluation metric, the result of better performing system has been boldfaced</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="FPar15">
<title>
<bold>Death Report</bold>
</title>
<p id="Par87">Table 
<xref rid="Tab15" ref-type="table">15</xref>
gives the ROUGE-1 F-scores and recall values for both the algorithms respectively. It is clear that our proposed method performs better compared to COWTS because disease related keywords capture more specific death related information compared to disaster specific content words.
<table-wrap id="Tab15">
<label>Table 15</label>
<caption>
<p>Comparison of ROUGE-1 recall and F-scores (Twitter-specific tags, emoticons, hashtags, mentions, urls, removed and standard rouge stemming(-m) and stopwords(-s) option) for MEDSUM (the proposed methodology) and the baseline method COWTS for death reports</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Event</th>
<th align="left" colspan="2">MEDSUM</th>
<th align="left" colspan="2">COWTS</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left"></td>
<td align="left">Recall</td>
<td align="left">F-score</td>
<td align="left">Recall</td>
<td align="left">F-score</td>
</tr>
<tr>
<td align="left">Ebola</td>
<td align="left">
<bold>0.4961</bold>
</td>
<td align="left">
<bold>0.4980</bold>
</td>
<td align="left">0.4961</td>
<td align="left">0.4942</td>
</tr>
<tr>
<td align="left">MERS</td>
<td align="left">
<bold>0.3862</bold>
</td>
<td align="left">
<bold>0.3758</bold>
</td>
<td align="left">0.3448</td>
<td align="left">0.3322</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>For each evaluation metric, the result of better performing system has been boldfaced</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="FPar16">
<title>
<bold>Disease Treatment</bold>
</title>
<p id="Par88">Table 
<xref rid="Tab16" ref-type="table">16</xref>
gives the ROUGE-1 F-scores and recall values for both the algorithms respectively. It is clear that coverage of treatment related information like drugs, medicines helps in better summarization.
<table-wrap id="Tab16">
<label>Table 16</label>
<caption>
<p>Comparison of ROUGE-1 recall and F-scores (Twitter-specific tags, emoticons, hashtags, mentions, urls, removed and standard rouge stemming(-m) and stopwords(-s) option) for MEDSUM (the proposed methodology) and the baseline method COWTS for treatment class</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Event</th>
<th align="left" colspan="2">MEDSUM</th>
<th align="left" colspan="2">COWTS</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">Recall</th>
<th align="left">F-score</th>
<th align="left">Recall</th>
<th align="left">F-score</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Ebola</td>
<td align="left">
<bold>0.4803</bold>
</td>
<td align="left">
<bold>0.4621</bold>
</td>
<td align="left">0.3858</td>
<td align="left">0.3525</td>
</tr>
<tr>
<td align="left">MERS</td>
<td align="left">
<bold>0.6517</bold>
</td>
<td align="left">
<bold>0.5983</bold>
</td>
<td align="left">0.4642</td>
<td align="left">0.4244</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>For each evaluation metric, the result of better performing system has been boldfaced</p>
</table-wrap-foot>
</table-wrap>
</p>
<p id="Par89">In general, we observe that health related informative words are helpful to achieve better information coverage compared to disaster specific words during epidemics.</p>
<p id="Par90">Further, we perform statistical t-test over six (3(
<italic>#</italic>
<italic>c</italic>
<italic>l</italic>
<italic>a</italic>
<italic>s</italic>
<italic>s</italic>
<italic>e</italic>
<italic>s</italic>
) ∗ 2(
<italic>#</italic>
<italic>d</italic>
<italic>a</italic>
<italic>t</italic>
<italic>a</italic>
<italic>s</italic>
<italic>e</italic>
<italic>t</italic>
<italic>s</italic>
)) ROUGE-1 F-scores (significance level 0.10) to check the statistical significance of MEDSUM over COWTS. The improvement appears to be statistically significant (the p-value is .0552).</p>
</sec>
</sec>
</sec>
</sec>
<sec id="Sec25">
<title>Discussion</title>
<p id="Par91">As stated earlier, primary objective of this work is to automatically extract and summarize information from microblog communications during epidemics to assist different stakeholders. In Section 
<xref rid="Sec13" ref-type="sec">5</xref>
, we have proposed different summarization techniques for different information classes like ‘symptom’, ‘prevention’, ‘treatment’ etc. Next, we discuss how this information helps in primary and secondary health care service.</p>
<sec id="FPar17">
<title>
<bold>Vulnerable Population</bold>
</title>
<p id="Par92">Summarizing information for ‘symptom’, ‘prevention’, and ‘transmission’ classes helps assist vulnerable end-users and primary health care service. These communities are vulnerable to the disease and precautionary steps are extremely helpful to restrict further spreading of the disease. For example, if people are aware of possible transmission mediums (human-to-human, animal-to-human, air, aerosol etc) of the disease then they can avoid those possibilities and take relevant preventive measures.</p>
</sec>
<sec id="FPar18">
<title>
<bold>Affected Population</bold>
</title>
<p id="Par93">Post-disease community is mostly looking for treatment related information like hospital, drugs, medicines etc. In Section 
<xref rid="Sec18" ref-type="sec">5.5</xref>
, we particularly tried to maximize such information via ILP approach. This kind of information helps in secondary health care services where treatment of patients is going on.</p>
</sec>
<sec id="FPar19">
<title>
<bold>Health Organizations</bold>
</title>
<p id="Par94">Government, health related organizations (WHO, CDC) looking for information about dead or affected people. Based on this kind of information they can decide whether medical response teams, new treatment centers etc. are necessary in certain regions or not. </p>
</sec>
<sec id="FPar20">
<title>
<bold>Effect of Misclassification on Summarization</bold>
</title>
<p id="Par95">As reported in Section 
<xref rid="Sec8" ref-type="sec">4</xref>
, our proposed classifier is able to achieve around 80% accuracy in in-domain scenario and 75% accuracy in cross-domain scenario (25% tweets are classified wrongly). After classification, our proposed summarization framework summarizes the tweets present in the different disease related classes like ‘symptoms’, ‘prevention’ etc. In this part, we analyze the effect of misclassification on the summarization output. In the summarization, ILP framework tries to maximize the relevant class specific terms which represent a particular class. For example, in prevention category, ILP framework tries to maximize
<italic>prevention related terms</italic>
. If a prevention related tweet is misclassified then important terms present in that class are also wasted because such terms are not relevant to other classes (symptom, treatment etc.). Here, we measure the fraction of terms lost due to misclassification. For this analysis, ground truth is required; hence, we measure these values over manually annotated ground truth data (Section 
<xref rid="Sec8" ref-type="sec">4</xref>
). We consider two different cross-domain scenarios where the model is trained over Ebola and tested over MERS and vice versa. Table 
<xref rid="Tab17" ref-type="table">17</xref>
shows the fraction of terms missed out for symptom, prevention, and treatment class for both Ebola and MERS. For MERS, we lose around 6–8% terms for symptom, prevention and 17% terms for treatment class. Similarly, for Ebola, around 17%, 28%, and 35% terms are lost due to misclassification for symptom, prevention, and treatment classes respectively. Overall, misclassification has an impact on overall summarization output. In future, we will incorporate other distinguishing low level features to reduce misclassification rate and improve the performance of classification-summarization framework.
<table-wrap id="Tab17">
<label>Table 17</label>
<caption>
<p>Fraction of class specific terms covered and missed in symptom, prevention, and treatment class for both Ebola and MERS</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Event</th>
<th align="left" colspan="2">Symptom</th>
<th align="left" colspan="2">Prevention</th>
<th align="left" colspan="2">Treatment</th>
</tr>
<tr>
<th align="left"></th>
<th align="left">Covered</th>
<th align="left">Missed</th>
<th align="left">Covered</th>
<th align="left">Missed</th>
<th align="left">Covered</th>
<th align="left">Missed</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Ebola</td>
<td align="left">82.35</td>
<td align="left">17.65</td>
<td align="left">71.43</td>
<td align="left">28.57</td>
<td align="left">65</td>
<td align="left">35</td>
</tr>
<tr>
<td align="left">MERS</td>
<td align="left">94.44</td>
<td align="left">5.56</td>
<td align="left">91.67</td>
<td align="left">8.33</td>
<td align="left">78.57</td>
<td align="left">21.43</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="FPar21">
<title>
<bold>Time Taken for Summarization</bold>
</title>
<p id="Par96">During epidemic, it is necessary to summarize the information in real time because time is very critical in such scenarios. Hence, we analyze the execution time of various summarization approaches. For symptom and disease transmission, our proposed method takes around 172, and 257 seconds on average (over Ebola and MERS) respectively to generate summaries. For prevention, treatment, and death reports, proposed method takes around 7.39, 12.57, and 9.31 seconds respectively on average. Symptom and transmission mediums extraction take more time due to parsing overhead; still it is able to extract information in close to real time.</p>
<p id="Par97">In this work, we observe that information is centered around some health related terms and we are trying to maximize the coverage of these terms in the final summary. We also measure the variation of running time with the number of tweets. We consider first 10,000 tweets from death report class of MERS (around 14,000 tweets) and measure the running time at ten equally spaced breakpoints (1000, 2000, ⋯, 10000). From Fig. 
<xref rid="Fig2" ref-type="fig">2</xref>
, we can observe that running time increases more or less linearly with the number of tweets. However, number of terms which contain information during catastrophes or epidemics are less in number and also grow slowly compared to other real-life events like sports, politics, movies (Rudra et al.
<xref ref-type="bibr" rid="CR31">2015</xref>
). Hence, our proposed method is scalable and able to provide summaries in real time over large number of disease related tweets.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Variation of running time with number of tweets</p>
</caption>
<graphic xlink:href="10796_2018_9844_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p id="Par98">As most of the information during an epidemic is centered around some specific terms (prevention terms, drugs, treatment concepts), our proposed framework basically tries to maximize the coverage of these terms in the final summary. We believe that this framework may be extended to other crisis scenarios. However, health related terms (prevention terms, drugs, treatment concepts) will not work in those cases. We have to identify the terms which are capable of covering most of the important information during other kind of crisis scenarios.</p>
</sec>
</sec>
<sec id="Sec26">
<title>Conclusion</title>
<p id="Par99">Sudden disease outbreaks bring challenges for vulnerable and affected communities. They seek answers to their apprehensions; what are the symptoms of the disease, preventive measures, and treatment strategies. Health organizations also look for situational updates from affected population to prepare response. In this work, we target three communities; vulnerable people, affected people, and health organizations. To provide precise and timely information to these communities, we have presented a classification-summarization approach to extract useful information from a microblogging platform during outbreaks. The proposed classification approach uses low-level lexical class-specific features to effectively categorize raw Twitter messages. We developed a domain-independent classifier which performs better than domain-dependent bag-of-words technique. Furthermore, various disease-category specific summarization approaches have been proposed. Often information posted on Twitter related to, for example, symptoms seems ambiguous for automatic information extractors. To deal with these issues, we generate separate lists representing positive and negative information. We make use of ILP techniques to generate 200-words summaries for some categories. Extensive experimentation conducted on real-world Twitter datasets from Ebola and MERS outbreaks show the effectiveness of the proposed approach. In future, we aim to deploy the system so that it can be practically used for any future epidemic.</p>
</sec>
</body>
<back>
<fn-group>
<fn id="Fn1">
<label>1</label>
<p id="Par35">All volunteers are regular users of Twitter, have a good knowledge of the English language.</p>
</fn>
<fn id="Fn2">
<label>2</label>
<p id="Par49">
<ext-link ext-link-type="uri" xlink:href="https://www.cdc.gov/coronavirus/mers/about/symptoms.html">https://www.cdc.gov/coronavirus/mers/about/symptoms.html</ext-link>
,
<ext-link ext-link-type="uri" xlink:href="https://www.cdc.gov/coronavirus/mers/about/prevention.html">https://www.cdc.gov/coronavirus/mers/about/prevention.html</ext-link>
</p>
</fn>
<fn id="Fn3">
<label>3</label>
<p id="Par51">
<ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/wiki/List_of_medical_symptoms">https://en.wikipedia.org/wiki/List_of_medical_symptoms</ext-link>
</p>
</fn>
<fn id="Fn4">
<label>4</label>
<p id="Par52">
<ext-link ext-link-type="uri" xlink:href="http://www.medicinenet.com/symptoms_and_signs/alpha_a.htm">http://www.medicinenet.com/symptoms_and_signs/alpha_a.htm</ext-link>
</p>
</fn>
<fn id="Fn5">
<label>5</label>
<p id="Par53">
<ext-link ext-link-type="uri" xlink:href="http://www.healthline.com/directory/symptoms">http://www.healthline.com/directory/symptoms</ext-link>
</p>
</fn>
<fn id="Fn6">
<label>6</label>
<p id="Par58">
<ext-link ext-link-type="uri" xlink:href="http://sentiment.christopherpotts.net/lingstruc.html">http://sentiment.christopherpotts.net/lingstruc.html</ext-link>
</p>
</fn>
<fn id="Fn7">
<label>7</label>
<p id="Par70">
<ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/wiki/Lists_of_cities_in_Africa">https://en.wikipedia.org/wiki/Lists_of_cities_in_Africa</ext-link>
,
<ext-link ext-link-type="uri" xlink:href="https://en.wikipedia.org/wiki/Middle_East">https://en.wikipedia.org/wiki/Middle_East</ext-link>
</p>
</fn>
<fn>
<p>This work is an extended version of the short paper: Rudra et al., Classifying Information from Microblogs during Epidemics, Proceedings of the 7th ACM International Conference on Digital Health, 2017.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>K. Rudra was supported by a fellowship from Tata Consultancy Services.</p>
</ack>
<notes>
<title>Compliance with Ethical Standards</title>
<notes notes-type="COI-statement">
<title>Competing interests</title>
<p id="Par100">The authors don’t have any competing interests in this paper.</p>
</notes>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<mixed-citation publication-type="other">Aronson, A.R. (2001). Effective mapping of biomedical text to the umls metathesaurus: the metamap program. In
<italic>Proceedings of the AMIA symposium (p. 17). American Medical Informatics Association</italic>
.</mixed-citation>
</ref>
<ref id="CR2">
<mixed-citation publication-type="other">Aspell-python. (2011). Python wrapper for aspell (C extension and python version).
<ext-link ext-link-type="uri" xlink:href="https://github.com/WojciechMula/aspell-python">https://github.com/WojciechMula/aspell-python</ext-link>
.</mixed-citation>
</ref>
<ref id="CR3">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bodenreider</surname>
<given-names>O</given-names>
</name>
</person-group>
<article-title>The unified medical language system (umls): integrating biomedical terminology</article-title>
<source>Nucleic Acids Research</source>
<year>2004</year>
<volume>32</volume>
<issue>suppl 1</issue>
<fpage>D267</fpage>
<lpage>D270</lpage>
<pub-id pub-id-type="doi">10.1093/nar/gkh061</pub-id>
<pub-id pub-id-type="pmid">14681409</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<mixed-citation publication-type="other">Centers for Disease Control and Prevention. (2014).
<ext-link ext-link-type="uri" xlink:href="https://www.cdc.gov/coronavirus/mers/">https://www.cdc.gov/coronavirus/mers/</ext-link>
.</mixed-citation>
</ref>
<ref id="CR5">
<mixed-citation publication-type="other">De Choudhury, M. (2015). Anorexia on tumblr: a characterization study. In
<italic>Proceedings of the 5th international conference on digital health 2015 (pp. 43–50). ACM</italic>
.</mixed-citation>
</ref>
<ref id="CR6">
<mixed-citation publication-type="other">de Quincey, E., Kyriacou, T., Pantin, T. (2016). # Hayfever; a longitudinal study into hay fever related tweets in the UK. In
<italic>Proceedings of the 6th international conference on digital health conference (pp. 85–89). ACM</italic>
.</mixed-citation>
</ref>
<ref id="CR7">
<mixed-citation publication-type="other">Denecke, K. (2014). Extracting medical concepts from medical social media with clinical nlp tools: a qualitative study. In
<italic>Proceedings of the fourth workshop on building and evaluation resources for health and biomedical text processing</italic>
.</mixed-citation>
</ref>
<ref id="CR8">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Denecke</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Nejdl</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>How valuable is medical social media data? Content analysis of the medical web</article-title>
<source>Information Sciences</source>
<year>2009</year>
<volume>179</volume>
<issue>12</issue>
<fpage>1870</fpage>
<lpage>1880</lpage>
<pub-id pub-id-type="doi">10.1016/j.ins.2009.01.025</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<mixed-citation publication-type="other">Elkin, N. (2008). How America searches: health and wellness. Opinion Research Corporation: iCrossing pp. 1–17.</mixed-citation>
</ref>
<ref id="CR10">
<mixed-citation publication-type="other">Esuli, A., & Sebastiani, F. (2007). SENTIWORDNET: a high-coverage lexical resource for opinion mining. Technical Report 2007-TR-02 Istituto di Scienza e Tecnologie dell’Informazione Consiglio Nazionale delle Ricerche Pisa IT.</mixed-citation>
</ref>
<ref id="CR11">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Fox</surname>
<given-names>S</given-names>
</name>
</person-group>
<source>The social life of health information, Vol. 2011</source>
<year>2011</year>
<publisher-loc>Washington, DC</publisher-loc>
<publisher-name>Pew Internet & American Life Project</publisher-name>
</element-citation>
</ref>
<ref id="CR12">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Friedman</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Hripcsak</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Shagina</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>Representing information in patient reports using natural language processing and the extensible markup language</article-title>
<source>Journal of the American Medical Informatics Association</source>
<year>1999</year>
<volume>6</volume>
<issue>1</issue>
<fpage>76</fpage>
<lpage>87</lpage>
<pub-id pub-id-type="doi">10.1136/jamia.1999.0060076</pub-id>
<pub-id pub-id-type="pmid">9925230</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Friedman</surname>
<given-names>C</given-names>
</name>
<name>
<surname>Shagina</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Lussier</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Hripcsak</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Automated encoding of clinical documents based on natural language processing</article-title>
<source>Journal of the American Medical Informatics Association</source>
<year>2004</year>
<volume>11</volume>
<issue>5</issue>
<fpage>392</fpage>
<lpage>402</lpage>
<pub-id pub-id-type="doi">10.1197/jamia.M1552</pub-id>
<pub-id pub-id-type="pmid">15187068</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<mixed-citation publication-type="other">Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., Smith, N.A. (2011). Part-of-speech tagging for twitter: annotation, features, and experiments. In
<italic>Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: short papers (Vol. 2, pp. 42–47). Association for Computational Linguistics</italic>
.</mixed-citation>
</ref>
<ref id="CR15">
<mixed-citation publication-type="other">Goodwin, T.R., & Harabagiu, S.M. (2016). Medical question answering for clinical decision support. In
<italic>Proceedings of the 25th ACM international on conference on information and knowledge management (pp. 297–306). ACM</italic>
.</mixed-citation>
</ref>
<ref id="CR16">
<mixed-citation publication-type="other">Gurobi. (2015). The overall fastest and best supported solver available.
<ext-link ext-link-type="uri" xlink:href="http://www.gurobi.com/">http://www.gurobi.com/</ext-link>
.</mixed-citation>
</ref>
<ref id="CR17">
<mixed-citation publication-type="other">Heinze, D.T., Morsch, M.L., Holbrook, J. (2001). Mining free-text medical records. In
<italic>Proceedings of the AMIA symposium (p. 254). American Medical Informatics Association</italic>
.</mixed-citation>
</ref>
<ref id="CR18">
<mixed-citation publication-type="other">Homan, C.M., Lu, N., Tu, X., Lytle, M.C., Silenzio, V. (2014). Social structure and depression in trevorspace. In
<italic>Proceedings of the 17th ACM conference on computer supported cooperative work & social computing (pp. 615–625). ACM</italic>
.</mixed-citation>
</ref>
<ref id="CR19">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hripcsak</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Austin</surname>
<given-names>JH</given-names>
</name>
<name>
<surname>Alderson</surname>
<given-names>PO</given-names>
</name>
<name>
<surname>Friedman</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports 1</article-title>
<source>Radiology</source>
<year>2002</year>
<volume>224</volume>
<issue>1</issue>
<fpage>157</fpage>
<lpage>163</lpage>
<pub-id pub-id-type="doi">10.1148/radiol.2241011118</pub-id>
<pub-id pub-id-type="pmid">12091676</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<mixed-citation publication-type="other">Imran, M., Castillo, C., Lucas, J., Meier, P., Vieweg, S. (2014). Aidr: Artificial intelligence for disaster response. In
<italic>Proceedings of the WWW companion</italic>
(pp. 159–162).</mixed-citation>
</ref>
<ref id="CR21">
<mixed-citation publication-type="other">Imran, M., Mitra, P., Castillo, C. (2016). Twitter as a lifeline: human-annotated twitter corpora for nlp of crisis-related messages. In
<italic>Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). European language resources association (ELRA), Paris, France</italic>
.</mixed-citation>
</ref>
<ref id="CR22">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kinnane</surname>
<given-names>NA</given-names>
</name>
<name>
<surname>Milne</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>The role of the internet in supporting and informing carers of people with cancer: a literature review</article-title>
<source>Supportive Care in Cancer</source>
<year>2010</year>
<volume>18</volume>
<issue>9</issue>
<fpage>1123</fpage>
<lpage>1136</lpage>
<pub-id pub-id-type="doi">10.1007/s00520-010-0863-4</pub-id>
<pub-id pub-id-type="pmid">20336326</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<mixed-citation publication-type="other">Kong, L., Schneider, N., Swayamdipta, S., Bhatia, A., Dyer, C., Smith, N.A. (2014). A dependency parser for tweets. In
<italic>Proceedings of the EMNLP</italic>
.</mixed-citation>
</ref>
<ref id="CR24">
<mixed-citation publication-type="other">Lin, C.Y. (2004). ROUGE: A package for automatic evaluation of summaries. In
<italic>Proceedings of the workshop on text summarization branches out (with ACL)</italic>
.</mixed-citation>
</ref>
<ref id="CR25">
<mixed-citation publication-type="other">Lu, Y., Zhang, P., Deng, S. (2013). Exploring health-related topics in online health community using cluster analysis. In
<italic>46th Hawaii international conference on system sciences (HICSS), 2013 (pp. 802–811). IEEE</italic>
.</mixed-citation>
</ref>
<ref id="CR26">
<mixed-citation publication-type="other">Maity, S., Chaudhary, A., Kumar, S., Mukherjee, A., Sarda, C., Patil, A., Mondal, A. (2016). Wassup? lol: characterizing out-of-vocabulary words in twitter. In
<italic>Proceedings of the 19th ACM conference on computer supported cooperative work and social computing companion, CSCW ’16 companion</italic>
(pp. 341–344). New York: ACM.</mixed-citation>
</ref>
<ref id="CR27">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Park</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Hartzler</surname>
<given-names>AL</given-names>
</name>
<name>
<surname>Huh</surname>
<given-names>J</given-names>
</name>
<name>
<surname>McDonald</surname>
<given-names>DW</given-names>
</name>
<name>
<surname>Pratt</surname>
<given-names>W</given-names>
</name>
</person-group>
<article-title>Automatically detecting failures in natural language processing tools for online community text</article-title>
<source>Journal of Medical Internet Research</source>
<year>2014</year>
<volume>17</volume>
<issue>8</issue>
<fpage>e212</fpage>
<lpage>e212</lpage>
<pub-id pub-id-type="doi">10.2196/jmir.4612</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Paul</surname>
<given-names>MJ</given-names>
</name>
<name>
<surname>Dredze</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>You are what you tweet: analyzing twitter for public health</article-title>
<source>Icwsm</source>
<year>2011</year>
<volume>20</volume>
<fpage>265</fpage>
<lpage>272</lpage>
</element-citation>
</ref>
<ref id="CR29">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pedregosa</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Varoquaux</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Gramfort</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Michel</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Thirion</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Grisel</surname>
<given-names>O</given-names>
</name>
<name>
<surname>Blondel</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Prettenhofer</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Weiss</surname>
<given-names>R</given-names>
</name>
<name>
<surname>Dubourg</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Vanderplas</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Passos</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Cournapeau</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Brucher</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Perrot</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Duchesnay</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>Scikit-learn: machine learning in Python</article-title>
<source>Journal of Machine Learning Research</source>
<year>2011</year>
<volume>12</volume>
<fpage>2825</fpage>
<lpage>2830</lpage>
</element-citation>
</ref>
<ref id="CR30">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Roberts</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Harabagiu</surname>
<given-names>SM</given-names>
</name>
</person-group>
<article-title>A flexible framework for deriving assertions from electronic medical records</article-title>
<source>Journal of the American Medical Informatics Association</source>
<year>2011</year>
<volume>18</volume>
<issue>5</issue>
<fpage>568</fpage>
<lpage>573</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2011-000152</pub-id>
<pub-id pub-id-type="pmid">21724741</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<mixed-citation publication-type="other">Rudra, K., Ghosh, S., Ganguly, N., Goyal, P., Ghosh, S. (2015). Extracting situational information from microblogs during disaster events: a classification-summarization approach. In
<italic>Proceedings of the CIKM</italic>
.</mixed-citation>
</ref>
<ref id="CR32">
<mixed-citation publication-type="other">Rudra, K., Sharma, A., Ganguly, N., Imran, M. (2017). Classifying information from microblogs during epidemics. In
<italic>Proceedings of the 2017 international conference on digital health (pp. 104–108). ACM</italic>
.</mixed-citation>
</ref>
<ref id="CR33">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Savova</surname>
<given-names>GK</given-names>
</name>
<name>
<surname>Masanz</surname>
<given-names>JJ</given-names>
</name>
<name>
<surname>Ogren</surname>
<given-names>PV</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Sohn</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kipper-Schuler</surname>
<given-names>KC</given-names>
</name>
<name>
<surname>Chute</surname>
<given-names>CG</given-names>
</name>
</person-group>
<article-title>Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications</article-title>
<source>Journal of the American Medical Informatics Association</source>
<year>2010</year>
<volume>17</volume>
<issue>5</issue>
<fpage>507</fpage>
<lpage>513</lpage>
<pub-id pub-id-type="doi">10.1136/jamia.2009.001560</pub-id>
<pub-id pub-id-type="pmid">20819853</pub-id>
</element-citation>
</ref>
<ref id="CR34">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Scanfeld</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Scanfeld</surname>
<given-names>V</given-names>
</name>
<name>
<surname>Larson</surname>
<given-names>EL</given-names>
</name>
</person-group>
<article-title>Dissemination of health information through social networks: twitter and antibiotics</article-title>
<source>American Journal of Infection Control</source>
<year>2010</year>
<volume>38</volume>
<issue>3</issue>
<fpage>182</fpage>
<lpage>188</lpage>
<pub-id pub-id-type="doi">10.1016/j.ajic.2009.11.004</pub-id>
<pub-id pub-id-type="pmid">20347636</pub-id>
</element-citation>
</ref>
<ref id="CR35">
<mixed-citation publication-type="other">Stearns, M.Q., Price, C., Spackman, K.A., Wang, A.Y. (2001). Snomed clinical terms: overview of the development process and project status. In
<italic>Proceedings of the AMIA symposium (p. 662). American Medical Informatics Association</italic>
.</mixed-citation>
</ref>
<ref id="CR36">
<mixed-citation publication-type="other">Tu, H., Ma, Z., Sun, A., Wang, X. (2016). When metamap meets social media in healthcare: are the word labels correct?. In
<italic>Information retrieval technology (pp. 356–362). Springer</italic>
.</mixed-citation>
</ref>
<ref id="CR37">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Uzuner</surname>
<given-names>Ö</given-names>
</name>
<name>
<surname>South</surname>
<given-names>BR</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>S</given-names>
</name>
<name>
<surname>DuVall</surname>
<given-names>SL</given-names>
</name>
</person-group>
<article-title>2010 I2b2/va challenge on concepts, assertions, and relations in clinical text</article-title>
<source>Journal of the American Medical Informatics Association</source>
<year>2011</year>
<volume>18</volume>
<issue>5</issue>
<fpage>552</fpage>
<lpage>556</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2011-000203</pub-id>
<pub-id pub-id-type="pmid">21685143</pub-id>
</element-citation>
</ref>
<ref id="CR38">
<mixed-citation publication-type="other">World Health Organization (WHO). (2014).
<ext-link ext-link-type="uri" xlink:href="http://www.who.int/mediacentre/">http://www.who.int/mediacentre/</ext-link>
.</mixed-citation>
</ref>
<ref id="CR39">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>FC</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>AJ</given-names>
</name>
<name>
<surname>Kuo</surname>
<given-names>SC</given-names>
</name>
</person-group>
<article-title>Mining health social media with sentiment analysis</article-title>
<source>Journal of medical systems</source>
<year>2016</year>
<volume>40</volume>
<issue>11</issue>
<fpage>236</fpage>
<pub-id pub-id-type="doi">10.1007/s10916-016-0604-4</pub-id>
<pub-id pub-id-type="pmid">27663246</pub-id>
</element-citation>
</ref>
<ref id="CR40">
<mixed-citation publication-type="other">Yom-Tov, E. (2015). Ebola data from the internet: an opportunity for syndromic surveillance or a news event?. In
<italic>Proceedings of the 5th international conference on digital health 2015 (pp. 115–119). ACM</italic>
.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Sante/explor/CovidV2/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000268 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000268 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Sante
   |area=    CovidV2
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:7087635
   |texte=   Classifying and Summarizing Information from Microblogs During Epidemics
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:NONE" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a CovidV2 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Sat Mar 28 17:51:24 2020. Site generation: Sun Jan 31 15:35:48 2021