Serveur d'exploration sur le patient édenté

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text-based electronic medical records

Identifieur interne : 000D66 ( Pmc/Corpus ); précédent : 000D65; suivant : 000D67

TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text-based electronic medical records

Auteurs : Frank Po-Yen Lin ; Adrian Pokorny ; Christina Teng ; Richard J. Epstein

Source :

RBID : PMC:5537364

Abstract

Vast amounts of clinically relevant text-based variables lie undiscovered and unexploited in electronic medical records (EMR). To exploit this untapped resource, and thus facilitate the discovery of informative covariates from unstructured clinical narratives, we have built a novel computational pipeline termed Text-based Exploratory Pattern Analyser for Prognosticator and Associator discovery (TEPAPA). This pipeline combines semantic-free natural language processing (NLP), regular expression induction, and statistical association testing to identify conserved text patterns associated with outcome variables of clinical interest. When we applied TEPAPA to a cohort of head and neck squamous cell carcinoma patients, plausible concepts known to be correlated with human papilloma virus (HPV) status were identified from the EMR text, including site of primary disease, tumour stage, pathologic characteristics, and treatment modalities. Similarly, correlates of other variables (including gender, nodal status, recurrent disease, smoking and alcohol status) were also reliably recovered. Using highly-associated patterns as covariates, a patient’s HPV status was classifiable using a bootstrap analysis with a mean area under the ROC curve of 0.861, suggesting its predictive utility in supporting EMR-based phenotyping tasks. These data support using this integrative approach to efficiently identify disease-associated factors from unstructured EMR narratives, and thus to efficiently generate testable hypotheses.


Url:
DOI: 10.1038/s41598-017-07111-0
PubMed: 28761061
PubMed Central: 5537364

Links to Exploration step

PMC:5537364

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text-based electronic medical records</title>
<author>
<name sortKey="Lin, Frank Po Yen" sort="Lin, Frank Po Yen" uniqKey="Lin F" first="Frank Po-Yen" last="Lin">Frank Po-Yen Lin</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9119 2677</institution-id>
<institution-id institution-id-type="GRID">grid.437825.f</institution-id>
<institution>Department of Oncology,</institution>
<institution>St Vincent’s Hospital & The Kinghorn Cancer Centre,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9983 6924</institution-id>
<institution-id institution-id-type="GRID">grid.415306.5</institution-id>
<institution>Garvan Institute of Medical Research,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pokorny, Adrian" sort="Pokorny, Adrian" uniqKey="Pokorny A" first="Adrian" last="Pokorny">Adrian Pokorny</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9119 2677</institution-id>
<institution-id institution-id-type="GRID">grid.437825.f</institution-id>
<institution>Department of Oncology,</institution>
<institution>St Vincent’s Hospital & The Kinghorn Cancer Centre,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Teng, Christina" sort="Teng, Christina" uniqKey="Teng C" first="Christina" last="Teng">Christina Teng</name>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0527 9653</institution-id>
<institution-id institution-id-type="GRID">grid.415994.4</institution-id>
<institution>Department of Medical Oncology,</institution>
<institution>Liverpool Hospital,</institution>
</institution-wrap>
Liverpool, Sydney, NSW Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Epstein, Richard J" sort="Epstein, Richard J" uniqKey="Epstein R" first="Richard J." last="Epstein">Richard J. Epstein</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9119 2677</institution-id>
<institution-id institution-id-type="GRID">grid.437825.f</institution-id>
<institution>Department of Oncology,</institution>
<institution>St Vincent’s Hospital & The Kinghorn Cancer Centre,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9983 6924</institution-id>
<institution-id institution-id-type="GRID">grid.415306.5</institution-id>
<institution>Garvan Institute of Medical Research,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">28761061</idno>
<idno type="pmc">5537364</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5537364</idno>
<idno type="RBID">PMC:5537364</idno>
<idno type="doi">10.1038/s41598-017-07111-0</idno>
<date when="2017">2017</date>
<idno type="wicri:Area/Pmc/Corpus">000D66</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000D66</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text-based electronic medical records</title>
<author>
<name sortKey="Lin, Frank Po Yen" sort="Lin, Frank Po Yen" uniqKey="Lin F" first="Frank Po-Yen" last="Lin">Frank Po-Yen Lin</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9119 2677</institution-id>
<institution-id institution-id-type="GRID">grid.437825.f</institution-id>
<institution>Department of Oncology,</institution>
<institution>St Vincent’s Hospital & The Kinghorn Cancer Centre,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9983 6924</institution-id>
<institution-id institution-id-type="GRID">grid.415306.5</institution-id>
<institution>Garvan Institute of Medical Research,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Pokorny, Adrian" sort="Pokorny, Adrian" uniqKey="Pokorny A" first="Adrian" last="Pokorny">Adrian Pokorny</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9119 2677</institution-id>
<institution-id institution-id-type="GRID">grid.437825.f</institution-id>
<institution>Department of Oncology,</institution>
<institution>St Vincent’s Hospital & The Kinghorn Cancer Centre,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Teng, Christina" sort="Teng, Christina" uniqKey="Teng C" first="Christina" last="Teng">Christina Teng</name>
<affiliation>
<nlm:aff id="Aff3">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0527 9653</institution-id>
<institution-id institution-id-type="GRID">grid.415994.4</institution-id>
<institution>Department of Medical Oncology,</institution>
<institution>Liverpool Hospital,</institution>
</institution-wrap>
Liverpool, Sydney, NSW Australia</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Epstein, Richard J" sort="Epstein, Richard J" uniqKey="Epstein R" first="Richard J." last="Epstein">Richard J. Epstein</name>
<affiliation>
<nlm:aff id="Aff1">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9119 2677</institution-id>
<institution-id institution-id-type="GRID">grid.437825.f</institution-id>
<institution>Department of Oncology,</institution>
<institution>St Vincent’s Hospital & The Kinghorn Cancer Centre,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff2">
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9983 6924</institution-id>
<institution-id institution-id-type="GRID">grid.415306.5</institution-id>
<institution>Garvan Institute of Medical Research,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Scientific Reports</title>
<idno type="eISSN">2045-2322</idno>
<imprint>
<date when="2017">2017</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p id="Par1">Vast amounts of clinically relevant text-based variables lie undiscovered and unexploited in electronic medical records (EMR). To exploit this untapped resource, and thus facilitate the discovery of informative covariates from unstructured clinical narratives, we have built a novel computational pipeline termed
<italic>T</italic>
ext-based
<italic>E</italic>
xploratory
<italic>P</italic>
attern
<italic>A</italic>
nalyser for
<italic>P</italic>
rognosticator and
<italic>A</italic>
ssociator discovery (TEPAPA). This pipeline combines semantic-free natural language processing (NLP), regular expression induction, and statistical association testing to identify conserved text patterns associated with outcome variables of clinical interest. When we applied TEPAPA to a cohort of head and neck squamous cell carcinoma patients, plausible concepts known to be correlated with human papilloma virus (HPV) status were identified from the EMR text, including site of primary disease, tumour stage, pathologic characteristics, and treatment modalities. Similarly, correlates of other variables (including gender, nodal status, recurrent disease, smoking and alcohol status) were also reliably recovered. Using highly-associated patterns as covariates, a patient’s HPV status was classifiable using a bootstrap analysis with a mean area under the ROC curve of 0.861, suggesting its predictive utility in supporting EMR-based phenotyping tasks. These data support using this integrative approach to efficiently identify disease-associated factors from unstructured EMR narratives, and thus to efficiently generate testable hypotheses.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Frankovich, J" uniqKey="Frankovich J">J Frankovich</name>
</author>
<author>
<name sortKey="Longhurst, Ca" uniqKey="Longhurst C">CA Longhurst</name>
</author>
<author>
<name sortKey="Sutherland, Sm" uniqKey="Sutherland S">SM Sutherland</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zheng, K" uniqKey="Zheng K">K Zheng</name>
</author>
<author>
<name sortKey="Mei, Q" uniqKey="Mei Q">Q Mei</name>
</author>
<author>
<name sortKey="Hanauer, Da" uniqKey="Hanauer D">DA Hanauer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kahn, Mg" uniqKey="Kahn M">MG Kahn</name>
</author>
<author>
<name sortKey="Weng, C" uniqKey="Weng C">C Weng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chute, Cg" uniqKey="Chute C">CG Chute</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Sledge, Gw" uniqKey="Sledge G">GW Sledge</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Abernethy, Ap" uniqKey="Abernethy A">AP Abernethy</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shrager, J" uniqKey="Shrager J">J Shrager</name>
</author>
<author>
<name sortKey="Tenenbaum, Jm" uniqKey="Tenenbaum J">JM Tenenbaum</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jensen, Pb" uniqKey="Jensen P">PB Jensen</name>
</author>
<author>
<name sortKey="Jensen, Lj" uniqKey="Jensen L">LJ Jensen</name>
</author>
<author>
<name sortKey="Brunak, S" uniqKey="Brunak S">S Brunak</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kho, An" uniqKey="Kho A">AN Kho</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Warner, Jl" uniqKey="Warner J">JL Warner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Denny, Jc" uniqKey="Denny J">JC Denny</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ritchie, Md" uniqKey="Ritchie M">MD Ritchie</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Denny, Jc" uniqKey="Denny J">JC Denny</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wei, Wq" uniqKey="Wei W">WQ Wei</name>
</author>
<author>
<name sortKey="Denny, Jc" uniqKey="Denny J">JC Denny</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kohane, Is" uniqKey="Kohane I">IS Kohane</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Denny, Jc" uniqKey="Denny J">JC Denny</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Uzuner, O" uniqKey="Uzuner O">O Uzuner</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Delisle, S" uniqKey="Delisle S">S DeLisle</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Roque, Fs" uniqKey="Roque F">FS Roque</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kullo, Ij" uniqKey="Kullo I">IJ Kullo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Fernandez Breis, Jt" uniqKey="Fernandez Breis J">JT Fernández-Breis</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Richesson, Rl" uniqKey="Richesson R">RL Richesson</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chaturvedi, Ak" uniqKey="Chaturvedi A">AK Chaturvedi</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Smith, Em" uniqKey="Smith E">EM Smith</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gillison, Ml" uniqKey="Gillison M">ML Gillison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marur, S" uniqKey="Marur S">S Marur</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Anaya Saavedra, G" uniqKey="Anaya Saavedra G">G Anaya-Saavedra</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Klussmann, Jp" uniqKey="Klussmann J">JP Klussmann</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="D Ouza, G" uniqKey="D Ouza G">G D’Souza</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Begum, S" uniqKey="Begum S">S Begum</name>
</author>
<author>
<name sortKey="Westra, Wh" uniqKey="Westra W">WH Westra</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mork, J" uniqKey="Mork J">J Mork</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gillison, Ml" uniqKey="Gillison M">ML Gillison</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hafkamp, Hc" uniqKey="Hafkamp H">HC Hafkamp</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Goldenberg, D" uniqKey="Goldenberg D">D Goldenberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="O Ullivan, B" uniqKey="O Ullivan B">B O’Sullivan</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Toutanova, K" uniqKey="Toutanova K">K Toutanova</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Klein, D" uniqKey="Klein D">D Klein</name>
</author>
<author>
<name sortKey="Manning, Cd" uniqKey="Manning C">CD Manning</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bui, Dd" uniqKey="Bui D">DD Bui</name>
</author>
<author>
<name sortKey="Zeng Treitler, Q" uniqKey="Zeng Treitler Q">Q Zeng-Treitler</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hall, M" uniqKey="Hall M">M Hall</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Freund, Y" uniqKey="Freund Y">Y Freund</name>
</author>
<author>
<name sortKey="Mason, L" uniqKey="Mason L">L Mason</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Prasse, P" uniqKey="Prasse P">P Prasse</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Savova, Gk" uniqKey="Savova G">GK Savova</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bland, Jm" uniqKey="Bland J">JM Bland</name>
</author>
<author>
<name sortKey="Altman, Dg" uniqKey="Altman D">DG Altman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Benjamini, Y" uniqKey="Benjamini Y">Y Benjamini</name>
</author>
<author>
<name sortKey="Hochberg, Y" uniqKey="Hochberg Y">Y Hochberg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hripcsak, G" uniqKey="Hripcsak G">G Hripcsak</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hripcsak, G" uniqKey="Hripcsak G">G Hripcsak</name>
</author>
<author>
<name sortKey="Albers, Dj" uniqKey="Albers D">DJ Albers</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hersh, Wr" uniqKey="Hersh W">WR Hersh</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Sci Rep</journal-id>
<journal-id journal-id-type="iso-abbrev">Sci Rep</journal-id>
<journal-title-group>
<journal-title>Scientific Reports</journal-title>
</journal-title-group>
<issn pub-type="epub">2045-2322</issn>
<publisher>
<publisher-name>Nature Publishing Group UK</publisher-name>
<publisher-loc>London</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">28761061</article-id>
<article-id pub-id-type="pmc">5537364</article-id>
<article-id pub-id-type="publisher-id">7111</article-id>
<article-id pub-id-type="doi">10.1038/s41598-017-07111-0</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text-based electronic medical records</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">http://orcid.org/0000-0001-9250-6874</contrib-id>
<name>
<surname>Lin</surname>
<given-names>Frank Po-Yen</given-names>
</name>
<address>
<email>f.lin@garvan.org.au</email>
</address>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Pokorny</surname>
<given-names>Adrian</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Teng</surname>
<given-names>Christina</given-names>
</name>
<xref ref-type="aff" rid="Aff3">3</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Epstein</surname>
<given-names>Richard J.</given-names>
</name>
<xref ref-type="aff" rid="Aff1">1</xref>
<xref ref-type="aff" rid="Aff2">2</xref>
</contrib>
<aff id="Aff1">
<label>1</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9119 2677</institution-id>
<institution-id institution-id-type="GRID">grid.437825.f</institution-id>
<institution>Department of Oncology,</institution>
<institution>St Vincent’s Hospital & The Kinghorn Cancer Centre,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</aff>
<aff id="Aff2">
<label>2</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0000 9983 6924</institution-id>
<institution-id institution-id-type="GRID">grid.415306.5</institution-id>
<institution>Garvan Institute of Medical Research,</institution>
</institution-wrap>
Darlinghurst, NSW Australia</aff>
<aff id="Aff3">
<label>3</label>
<institution-wrap>
<institution-id institution-id-type="ISNI">0000 0004 0527 9653</institution-id>
<institution-id institution-id-type="GRID">grid.415994.4</institution-id>
<institution>Department of Medical Oncology,</institution>
<institution>Liverpool Hospital,</institution>
</institution-wrap>
Liverpool, Sydney, NSW Australia</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>31</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>31</day>
<month>7</month>
<year>2017</year>
</pub-date>
<pub-date pub-type="collection">
<year>2017</year>
</pub-date>
<volume>7</volume>
<elocation-id>6918</elocation-id>
<history>
<date date-type="received">
<day>3</day>
<month>11</month>
<year>2016</year>
</date>
<date date-type="accepted">
<day>21</day>
<month>6</month>
<year>2017</year>
</date>
</history>
<permissions>
<copyright-statement>© The Author(s) 2017</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit
<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<p id="Par1">Vast amounts of clinically relevant text-based variables lie undiscovered and unexploited in electronic medical records (EMR). To exploit this untapped resource, and thus facilitate the discovery of informative covariates from unstructured clinical narratives, we have built a novel computational pipeline termed
<italic>T</italic>
ext-based
<italic>E</italic>
xploratory
<italic>P</italic>
attern
<italic>A</italic>
nalyser for
<italic>P</italic>
rognosticator and
<italic>A</italic>
ssociator discovery (TEPAPA). This pipeline combines semantic-free natural language processing (NLP), regular expression induction, and statistical association testing to identify conserved text patterns associated with outcome variables of clinical interest. When we applied TEPAPA to a cohort of head and neck squamous cell carcinoma patients, plausible concepts known to be correlated with human papilloma virus (HPV) status were identified from the EMR text, including site of primary disease, tumour stage, pathologic characteristics, and treatment modalities. Similarly, correlates of other variables (including gender, nodal status, recurrent disease, smoking and alcohol status) were also reliably recovered. Using highly-associated patterns as covariates, a patient’s HPV status was classifiable using a bootstrap analysis with a mean area under the ROC curve of 0.861, suggesting its predictive utility in supporting EMR-based phenotyping tasks. These data support using this integrative approach to efficiently identify disease-associated factors from unstructured EMR narratives, and thus to efficiently generate testable hypotheses.</p>
</abstract>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2017</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1" sec-type="introduction">
<title>Introduction</title>
<p id="Par2">The widespread digitisation of clinical data through the adoption of electronic medical records (EMR) have speculated many secondary uses across clinical and research applications
<sup>
<xref ref-type="bibr" rid="CR1">1</xref>
<xref ref-type="bibr" rid="CR4">4</xref>
</sup>
. In particular, as data sharing frameworks have been developed, healthcare data analytics has emerged as a new field of translational science
<sup>
<xref ref-type="bibr" rid="CR3">3</xref>
</sup>
. As an illustrative example in oncology, the CancerLinQ framework of American Society of Clinical Oncology provides a “rapid learning health system” that connects isolated EMR systems across institutions to expedite collaborative patient management
<sup>
<xref ref-type="bibr" rid="CR5">5</xref>
,
<xref ref-type="bibr" rid="CR6">6</xref>
</sup>
. Developing pragmatic, automated methods to leverage this huge resource would soon impact on translational cancer research
<sup>
<xref ref-type="bibr" rid="CR7">7</xref>
,
<xref ref-type="bibr" rid="CR8">8</xref>
</sup>
. Moreover, from a precision medicine perspective, finding accurate associative and prognostic factors should empower clinicians to tailor effective treatments.</p>
<p id="Par3">Many EMR-based secondary analyses have correlated outcomes data to structured variables (e.g., laboratory and medication) or administrative coding (e.g., billing) to unearth knowledge that would otherwise remain occult
<sup>
<xref ref-type="bibr" rid="CR9">9</xref>
<xref ref-type="bibr" rid="CR13">13</xref>
</sup>
. These abridged data, however, represent only a proverbial tip of the clinical iceberg. For example, EMR narratives generate great informatic potency via the rich combination of subjective patient encounters with objective and/or measurable clinical events
<sup>
<xref ref-type="bibr" rid="CR14">14</xref>
<xref ref-type="bibr" rid="CR16">16</xref>
</sup>
. Methods of simple text search and natural language processing (NLP) have been applied to infer patient characteristics (i.e., EMR-based case detection and phenotyping methods) from clinical narratives to discover new and possibly causal associations
<sup>
<xref ref-type="bibr" rid="CR13">13</xref>
,
<xref ref-type="bibr" rid="CR17">17</xref>
<xref ref-type="bibr" rid="CR22">22</xref>
</sup>
. However, although these high-throughput analyses may be powerful in quantifying the degree of association, an important limitation is that the covariates yet to be recognised by domain experts cannot be reliably assessed.</p>
<p id="Par4">Hence, to systematically identify unrecognised covariates at an early phase of discovery, we hypothesise a need to mine EMR matrix features in a “deep-data” manner to complement population-based “big-data” inquiries. To this end we present here an unbiased feature-learning pipeline,
<italic>T</italic>
ext-based
<italic>E</italic>
xploratory
<italic>P</italic>
attern
<italic>A</italic>
nalyser for
<italic>P</italic>
rognosticator and
<italic>A</italic>
ssociator discovery (TEPAPA), which combines semantic-free NLP methods, pattern search, and a “pattern-wide association study” (thereafter PatWAS) to capture conserved patterns of EMR text associated with clinical outcomes of interest. With translational utility in mind, TEPAPA is designed to deliver “white-box” interpretable results to researchers for rapid hypothesis generation, thereby providing an open-source framework that drives integration of external NLP and machine learning methods.</p>
<p id="Par5">To determine how TEPAPA performs in a real-life discovery task, we conduct here a single-centred validation study to determine whether or not clinicopathologic factors associated with human papilloma virus (HPV)-related head and neck squamous cell carcinoma (HNSCC) can be discovered from routine clinical EMR data. The epidemic increases of HPV-related cases over the last two decades reflect changes in sexual practice among younger adults
<sup>
<xref ref-type="bibr" rid="CR23">23</xref>
</sup>
; since the clinicopathologic characteristics associated with this cancer have been thoroughly studied
<sup>
<xref ref-type="bibr" rid="CR24">24</xref>
<xref ref-type="bibr" rid="CR36">36</xref>
</sup>
, testing of these data sets for rediscovery evaluations is attractive. Beyond this knowledge discovery task, we also examine whether the highly-correlated text features extracted by TEPAPA can be used to classify a patient’s HPV status in combination with supervised machine learning – and if so, to yield a demonstration of practical utility of this pipeline for supporting EMR-based phenotyping applications.</p>
</sec>
<sec id="Sec2" sec-type="materials|methods">
<title>Methods</title>
<sec id="Sec3">
<title>The in silico discovery pipeline</title>
<sec id="Sec4">
<title>Case identification, EMR retrieval, and data cleaning</title>
<p id="Par6">The discovery process begins with identification of representative cases and controls providing sufficient data quantity and quality to frame a clinical question of interest. Each case is labelled with an outcome variable of interest (either binary or numeric) for correlative analyses. The corresponding EMR text narratives, including clinical correspondence, consultation notes, radiology and pathology reports, are extracted. Sentence chunking is then performed, followed by zero or more annotation methods (see below) prior to transformation into sequences of word-based tokens delimited by white spaces and punctuation marks. The flowchart of analysis is shown in Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
.
<fig id="Fig1">
<label>Figure 1</label>
<caption>
<p>The TEPAPA discovery pipeline. Abbreviations: EMR: electronic medical record.</p>
</caption>
<graphic xlink:href="41598_2017_7111_Fig1_HTML" id="d29e366"></graphic>
</fig>
</p>
</sec>
<sec id="Sec5">
<title>Text annotation</title>
<p id="Par7">Two classes of optional pre-processing methods were used to annotate the EMR text (Fig. 
<xref rid="Fig2" ref-type="fig">2A</xref>
):
<list list-type="order">
<list-item>
<p id="Par8">A
<italic>token-level annotation</italic>
method that assigns tags to a token in order to reflect its properties. Annotations of this class include labelling of cardinal numbers, word stemming (STEM)
<sup>
<xref ref-type="bibr" rid="CR37">37</xref>
</sup>
, part-of-speech tagging (POSTAG)
<sup>
<xref ref-type="bibr" rid="CR38">38</xref>
</sup>
and/or lemmatisation. The overall goal here is to improve sensitivity (i.e., recall) of a pattern.</p>
</list-item>
<list-item>
<p id="Par9">A
<italic>sequence-level annotation</italic>
method that improves specificity through reduction of spurious discoveries by grouping consecutive token descriptors of a given concept into a new token. For example, “
<italic>head of pancreas</italic>
” is treated as a unigram instead of separate words “
<italic>head</italic>
”, “
<italic>of</italic>
”, and “
<italic>pancreas</italic>
” - which have different meanings. Two annotation methods of this category were examined:
<list list-type="alpha-lower">
<list-item>
<p id="Par10">
<italic>Syntactic parsing</italic>
(SPARSE), which transforms a sentence into the PennTree bank format using the Stanford CoreNLP Parser
<sup>
<xref ref-type="bibr" rid="CR39">39</xref>
</sup>
and new tokens are generated by traversing through each node of the tree structure;</p>
</list-item>
<list-item>
<p id="Par11">
<italic>Vocabulary-based concept recognition</italic>
maps recognised text fragments into a new unigram based on United Medical Language System (UMLS) vocabulary (Metathesaurus, version 2016AA) using longest-string matching
<sup>
<xref ref-type="bibr" rid="CR40">40</xref>
,
<xref ref-type="bibr" rid="CR41">41</xref>
</sup>
.</p>
</list-item>
</list>
</p>
</list-item>
</list>
<fig id="Fig2">
<label>Figure 2</label>
<caption>
<p>Illustrated methods of annotation, sub-sequence search, and regular expression induction. EMR narratives are tokenized, annotated, and transformed into text fragments (n-gram) prior to association testing. Syntactically similar n-grams are then (optionally) grouped into regular expressions with the aim to aggregate conceptually similar features improve overall recall.</p>
</caption>
<graphic xlink:href="41598_2017_7111_Fig2_HTML" id="d29e456"></graphic>
</fig>
</p>
</sec>
<sec id="Sec6">
<title>Feature generation through exhaustive sequence search</title>
<p id="Par12">The most basic feature for discovery is defined as a string of word-based tokens (
<italic>n</italic>
-gram). Unique
<italic>n</italic>
-grams are identified through a corpus-wide exhaustive search (Fig. 
<xref rid="Fig2" ref-type="fig">2B</xref>
) and all
<italic>n</italic>
-grams are used as
<italic>binary features</italic>
(i.e. either present or absent in a case) in the subsequent association analysis. The extent of search is delimited by sentence and document boundaries. If a token-based annotation method is used, a combinatorial search method is applied to generate all possible sub-sequences using all tokens and tags (Fig. 
<xref rid="Fig2" ref-type="fig">2B</xref>
); these patterns are then used in the subsequent association analysis.</p>
<p id="Par13">The
<italic>numeric features</italic>
, which take form of “
<italic>A</italic>
<italic>NUMBER</italic>
<italic>B</italic>
” (e.g. “
<italic>contains</italic>
<italic>NUMBER</italic>
<italic>metastatic nodes</italic>
”, are first identified by extracting all cardinal numbers from the text, followed by identification of a pair of flanking
<italic>n</italic>
-grams (
<italic>A</italic>
and
<italic>B</italic>
) using the same exhaustive search methods above. If a flanking pair occurs more than once in a case, the pattern is discarded to avoid ambiguity. The numeric value is then extracted for association analysis.</p>
</sec>
<sec id="Sec7">
<title>Statistical association analysis (“PatWAS”)</title>
<p id="Par14">Non-parametric univariate methods are applied to assess the statistical independence between a feature and the outcome variable of interest. For a binary feature, we first determined a vector to indicate its occurrences across all case (i.e. occurrence profile), followed by calculation of the odds ratio (OR) and Fisher’s exact test for binary outcome variables, and the area under the receiver operating characteristic curve (AUC) for numeric variables (Mann-Whitley-Wilcoxon test). For a numeric feature, the degree of association is determined by AUC (binary outcomes) and Spearman’s ρ (for numeric outcomes).</p>
</sec>
<sec id="Sec8">
<title>Feature filtering and reduction</title>
<p id="Par15">Features are filtered by an
<italic>ad hoc</italic>
significance threshold assigned by the investigator, considering the data characteristics and multiple hypothesis testing. Highly-correlated patterns that do not improve interpretability of results are removed: a feature is removed if there exists a longer sequence sharing the same occurrence profile (e.g., “
<italic>extensive liver metastases</italic>
” has more explanatory power than “
<italic>liver metastases</italic>
” and “
<italic>metastases</italic>
”, if all three
<italic>n</italic>
-grams appear in the same occurrence profile).</p>
</sec>
<sec id="Sec9">
<title>Post-processing of binary features by predictive regular expression induction</title>
<p id="Par16">Syntactically-similar but weakly predictive text fragments may be grouped together to form a stronger “meta-feature” to improve recall. As an example, “
<italic>extensive bone metastasis</italic>
” and “
<italic>extensive liver metastasis</italic>
” may be combined to form a regular expression “
<italic>extensive</italic>
(
<italic>bone|liver</italic>
)
<italic>metastasis</italic>
” to indicate a new composite concept. To generate such regular expressions, we first identify all
<italic>n</italic>
-grams sharing the same starting and ending tokens. Needleman-Wunsch algorithm is then applied to perform global sequence alignment, followed by a consolidation algorithm to group sequences into a linear, non-recursive expression as depicted in Fig. 
<xref rid="Fig2" ref-type="fig">2C</xref>
. Previously, regular expressions have been shown to improve precision in information extraction from clinical text
<sup>
<xref ref-type="bibr" rid="CR42">42</xref>
</sup>
. In contrast to the local alignment approach
<sup>
<xref ref-type="bibr" rid="CR42">42</xref>
</sup>
, we used global alignment because a wildcard at the either end of a regular expression would result in non-discriminant matching of token and the consequent loss of specificity. The degree of association of induced regular expression is then reassessed by the PatWAS step above.</p>
</sec>
<sec id="Sec10">
<title>Performance considerations</title>
<p id="Par17">Heuristics are applied to reduce the hypothesis space as “curse of dimensionality” is unavoidable in any high-dimensional analyses. Techniques used to improve the pipeline efficiency include aggressive result caching, token indexing, and search termination if an elongating pattern occurs only once in the corpus. In particular, exhaustive traversal through all annotated subclasses (e.g. part of speech and concept hierarchy) would incur a theoretical time complexity of O(c
<sup>N</sup>
) (c > 1, i.e. exponential time), thus needing aggressive feature reduction: when a token-based annotation method is used, we first remove annotations that are uniquely associated with a token without an occurrence elsewhere in the corpus; up to 90% of annotations may be removed by this approach.</p>
</sec>
</sec>
<sec id="Sec11">
<title>The HNSCC validation cohort</title>
<sec id="Sec12">
<title>Study population</title>
<p id="Par18">Consecutive patients presented to a tertiary referral hospital over a twelve-month period (February 2015–February 2016) were screened for inclusion. The cases were dichotomised into HPV-related and -unrelated groups by documented
<italic>in situ</italic>
hybridisation (ISH) results (either mentioned in correspondence or pathology report) or P16 (cyclin-dependent kinase inhibitor 2A protein, encoded by
<italic>CDKN2A</italic>
gene) immunohistochemistry (at least 2+), which was used as a surrogate marker if an ISH assay was not performed.</p>
</sec>
<sec id="Sec13">
<title>Data extraction</title>
<p id="Par19">The free-text component of clinical documents associated with each case, including multidisciplinary team (MDT) meeting reports, clinic letters, radiology and pathology reports, were extracted from EMR to form the corpus. The patient identifiers, name and role of clinicians, and practice addresses were removed using string matching, followed by a manual verification by the lead investigator. Three investigators independently reviewed the HPV status of all cases (FL, AP, and CT). Blood-based assays were not included in this analysis.</p>
</sec>
<sec id="Sec14">
<title>Statistical and exploratory analyses</title>
<p id="Par20">Clinicopathologic variables were analysed by descriptive statistics using R statistical environment version 3.3. Qualitative analyses of pattern discovered by TEPAPA was reviewed by the authors and also compared with published literature.</p>
</sec>
<sec id="Sec15">
<title>Predictive analysis</title>
<p id="Par21">We further examined whether the highly-associated text patterns can be used in conjunction with supervised learning to predict case labels. To assess how pipeline variations may affect the accuracy of prediction and computational time, we used a factorial design to vary methods of annotation (part-of-speech tagging, syntactic parsing, word stemming, UMLS-based token aggregation), post-processing (with or without regular expression induction), threshold selection (log
<sub>10</sub>
deviation from best threshold), in conjunction with different machine learning algorithms.</p>
<p id="Par22">Each pipeline was applied to identify text features associated with the HPV status. To avoid selecting highly co-linear features, we applied hierarchical clustering with Unweighted Paired Groups Mean Average (UPGMA) algorithm and Euclidean distance to cluster the features into one-tenth of sample size (i.e. N/10) groups. The features with the smallest p-value from each group were used for classification. Waikato Environment for Knowledge Analysis (WEKA) 3.6.6 was used for classifier training and evaluation
<sup>
<xref ref-type="bibr" rid="CR43">43</xref>
</sup>
. Both generative (logistic regression, LR) and two discriminative classifiers Naive Bayes (NB) and alternating decision tree model (ADTree)
<sup>
<xref ref-type="bibr" rid="CR44">44</xref>
</sup>
with ten boosting iterations were examined. The predictive accuracy was assessed by AUC averaging over 25 bootstrap runs. The relative computational time was also analysed. Multiple linear regressions were used for the statistical analysis.</p>
</sec>
<sec id="Sec16">
<title>Ethics approval and informed consent</title>
<p id="Par23">This study was approved by St. Vincent’s Hospital Human Research Ethics Committee (HREC), Sydney, Australia. Data collection and analysis were conducted in accordance to the HREC regulations and the National Statement on Ethical Conduct in Human Research (2007), published by the Australian National Health and Medical Research Council (NHMRC). The need for informed consent was waived by the HREC for this retrospective study.</p>
</sec>
</sec>
<sec id="Sec17">
<title>Ethics</title>
<p id="Par24">This study was approved by the Human Research Ethics Committee (HREC) of St. Vincent’s Hospital, Sydney, Australia (Reference number: LNR/15/SVH/458).</p>
</sec>
</sec>
<sec id="Sec18" sec-type="results">
<title>Results</title>
<sec id="Sec19">
<title>Characteristics of the study cohort and EMR corpus</title>
<p id="Par25">One-hundred-and-eighty-nine consecutive patients attended the head and neck multidisciplinary team (MDT) cancer clinic at the study site from February 2015 to February 2016 were screened (Table 
<xref rid="Tab1" ref-type="table">1</xref>
). A total of 141 patients with documented squamous cell carcinoma were further inspected (Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
). Approximately two thirds (N = 50) of 82 patients had documented HPV/P16 positive diseases (i.e. HPV-related) either in the pathology report or in other clinical correspondence (e.g., performed by external pathology services). Three cases were subsequently found to contain no tumour in subsequent surgical or repeated biopsy specimens.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>The characteristics of HNSCC cohort by HPV/P16 status.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="3">Characteristic</th>
<th rowspan="3">Value</th>
<th colspan="4">HPV/P16 status</th>
<th rowspan="3">P
<sup>
<italic>a</italic>
</sup>
</th>
</tr>
<tr>
<th colspan="2">Positive (n = 50)</th>
<th colspan="2">Negative (n = 32)</th>
</tr>
<tr>
<th>N</th>
<th>(%)</th>
<th>N</th>
<th>(%)</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="7">
<italic>Demographics</italic>
</td>
</tr>
<tr>
<td> Age at diagnosis</td>
<td>Mean (years)</td>
<td>61.5</td>
<td>(95%CI: 58.9–64.2)</td>
<td>65.5</td>
<td>(95%CI: 60.9–70)</td>
<td>0.14</td>
</tr>
<tr>
<td rowspan="2"> Gender</td>
<td>Male</td>
<td>44</td>
<td>(88)</td>
<td>25</td>
<td>(78)</td>
<td>0.38</td>
</tr>
<tr>
<td>Female</td>
<td>6</td>
<td>(12)</td>
<td>7</td>
<td>(22)</td>
<td></td>
</tr>
<tr>
<td colspan="7">
<italic>Tumour characteristics</italic>
</td>
</tr>
<tr>
<td rowspan="2"> Diagnosis</td>
<td>Squamous cell carcinoma</td>
<td>49</td>
<td>(98)</td>
<td>30</td>
<td>(94)</td>
<td>0.28</td>
</tr>
<tr>
<td>Other tumour types</td>
<td>1</td>
<td>(2)</td>
<td>2</td>
<td>(6)</td>
<td></td>
</tr>
<tr>
<td rowspan="3"> Laterality</td>
<td>Right</td>
<td>20</td>
<td>(61)</td>
<td>4</td>
<td>(40)</td>
<td>0.37</td>
</tr>
<tr>
<td>Left</td>
<td>12</td>
<td>(36)</td>
<td>6</td>
<td>(60)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
<sup>
<italic>b</italic>
</sup>
</td>
<td>17</td>
<td></td>
<td>22</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="8"> Site of origin</td>
<td>Oropharynx</td>
<td>42</td>
<td>(89)</td>
<td>14</td>
<td>(48)</td>
<td><0.01
<sup>
<italic>c</italic>
</sup>
</td>
</tr>
<tr>
<td>Skin</td>
<td>2</td>
<td>(4)</td>
<td>3</td>
<td>(10)</td>
<td></td>
</tr>
<tr>
<td>Larynx</td>
<td>0</td>
<td>(0)</td>
<td>9</td>
<td>(31)</td>
<td></td>
</tr>
<tr>
<td>Lip</td>
<td>1</td>
<td>(2)</td>
<td>2</td>
<td>(7)</td>
<td></td>
</tr>
<tr>
<td>Nasal cavity</td>
<td>1</td>
<td>(2)</td>
<td>0</td>
<td>(0)</td>
<td></td>
</tr>
<tr>
<td>Nasopharynx</td>
<td>1</td>
<td>(2)</td>
<td>0</td>
<td>(0)</td>
<td></td>
</tr>
<tr>
<td>Salivary gland</td>
<td>0</td>
<td>(0)</td>
<td>1</td>
<td>(3)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>3</td>
<td></td>
<td>3</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="3"> Recurrent disease</td>
<td>Yes</td>
<td>20</td>
<td>(43)</td>
<td>14</td>
<td>(45)</td>
<td>1</td>
</tr>
<tr>
<td>No</td>
<td>26</td>
<td>(57)</td>
<td>17</td>
<td>(55)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>4</td>
<td></td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="7">
<italic>Anatomical stage</italic>
</td>
</tr>
<tr>
<td rowspan="6"> T category</td>
<td>T1</td>
<td>11</td>
<td>(23)</td>
<td>7</td>
<td>(23)</td>
<td>0.52</td>
</tr>
<tr>
<td>T2</td>
<td>14</td>
<td>(29)</td>
<td>5</td>
<td>(16)</td>
<td></td>
</tr>
<tr>
<td>T3</td>
<td>14</td>
<td>(29)</td>
<td>10</td>
<td>(32)</td>
<td></td>
</tr>
<tr>
<td>T4</td>
<td>3</td>
<td>(6)</td>
<td>5</td>
<td>(16)</td>
<td></td>
</tr>
<tr>
<td>Tx</td>
<td>6</td>
<td>(12)</td>
<td>4</td>
<td>(13)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>2</td>
<td></td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="9"> N category</td>
<td>N0</td>
<td>10</td>
<td>(21)</td>
<td>11</td>
<td>(35)</td>
<td>0.35</td>
</tr>
<tr>
<td>N1</td>
<td>9</td>
<td>(19)</td>
<td>6</td>
<td>(19)</td>
<td></td>
</tr>
<tr>
<td>N2, nos</td>
<td>3</td>
<td>(6)</td>
<td>1</td>
<td>(3)</td>
<td></td>
</tr>
<tr>
<td>N2a</td>
<td>7</td>
<td>(15)</td>
<td>1</td>
<td>(3)</td>
<td></td>
</tr>
<tr>
<td>N2b</td>
<td>11</td>
<td>(23)</td>
<td>7</td>
<td>(23)</td>
<td></td>
</tr>
<tr>
<td>N2c</td>
<td>7</td>
<td>(15)</td>
<td>2</td>
<td>(6)</td>
<td></td>
</tr>
<tr>
<td>N3</td>
<td>0</td>
<td>(0)</td>
<td>1</td>
<td>(3)</td>
<td></td>
</tr>
<tr>
<td>Nx</td>
<td>1</td>
<td>(2)</td>
<td>2</td>
<td>(6)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>2</td>
<td></td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="4"> M category</td>
<td>M0</td>
<td>43</td>
<td>(90)</td>
<td>28</td>
<td>(90)</td>
<td>0.39</td>
</tr>
<tr>
<td>M1</td>
<td>0</td>
<td>(0)</td>
<td>1</td>
<td>(3)</td>
<td></td>
</tr>
<tr>
<td>Mx</td>
<td>5</td>
<td>(10)</td>
<td>2</td>
<td>(6)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>2</td>
<td></td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="5">TNM Stage (7th edition)</td>
<td>I</td>
<td>2</td>
<td>(4)</td>
<td>5</td>
<td>(17)</td>
<td>0.17</td>
</tr>
<tr>
<td>II</td>
<td>2</td>
<td>(4)</td>
<td>2</td>
<td>(7)</td>
<td></td>
</tr>
<tr>
<td>III</td>
<td>13</td>
<td>(27)</td>
<td>7</td>
<td>(23)</td>
<td></td>
</tr>
<tr>
<td>IV</td>
<td>31</td>
<td>(65)</td>
<td>16</td>
<td>(53)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>2</td>
<td></td>
<td>2</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="7">
<italic>Smoking status</italic>
</td>
</tr>
<tr>
<td rowspan="3"> Ever smoked</td>
<td>Yes</td>
<td>22</td>
<td>(56)</td>
<td>20</td>
<td>(74)</td>
<td>0.23</td>
</tr>
<tr>
<td>No</td>
<td>17</td>
<td>(44)</td>
<td>7</td>
<td>(26)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>11</td>
<td></td>
<td>5</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="2"> Smoking history</td>
<td>Median (pack-years)</td>
<td>0</td>
<td>(IQR: 0–27.5)</td>
<td>25</td>
<td>(IQR: 0–50)</td>
<td>0.02</td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>19</td>
<td></td>
<td>8</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="3"> Current smoker</td>
<td>Yes</td>
<td>11</td>
<td>(28)</td>
<td>10</td>
<td>(37)</td>
<td>0.625</td>
</tr>
<tr>
<td>No</td>
<td>28</td>
<td>(72)</td>
<td>17</td>
<td>(63)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>11</td>
<td></td>
<td>5</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="2"> Current amount</td>
<td>Median</td>
<td>0</td>
<td>(IQR: 0–0)</td>
<td>10</td>
<td>(IQR: 0–22.5)</td>
<td>0.17</td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>28</td>
<td></td>
<td>17</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="2"> Last smoked</td>
<td>Median (years ago)</td>
<td>1.12</td>
<td>(IQR: 0.812–3.19)</td>
<td>21</td>
<td>(IQR: 18.5–24)</td>
<td>0.02</td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>50</td>
<td></td>
<td>26</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="7">
<italic>Alcohol use</italic>
</td>
</tr>
<tr>
<td rowspan="3"> Ever consumed</td>
<td>Yes</td>
<td>27</td>
<td>(82)</td>
<td>21</td>
<td>(84)</td>
<td>1</td>
</tr>
<tr>
<td>No</td>
<td>6</td>
<td>(18)</td>
<td>4</td>
<td>(16)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>17</td>
<td></td>
<td>7</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="3"> Current drinker</td>
<td>Yes</td>
<td>23</td>
<td>(70)</td>
<td>18</td>
<td>(72)</td>
<td>1</td>
</tr>
<tr>
<td>No</td>
<td>10</td>
<td>(30)</td>
<td>7</td>
<td>(28)</td>
<td></td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>17</td>
<td></td>
<td>7</td>
<td></td>
<td></td>
</tr>
<tr>
<td rowspan="2"> Current amount</td>
<td>Median (grams/day)</td>
<td>60</td>
<td>(IQR: 20–80)</td>
<td>40</td>
<td>(IQR: 20–80)</td>
<td>0.70</td>
</tr>
<tr>
<td>
<italic>Not specified</italic>
</td>
<td>23</td>
<td></td>
<td>11</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>NB: IQR: Inter-quartile range; (a) Fisher’s exact test was used for hypothesis testing on categorical and binary data. Shapiro-Wilk test was used to determine the normality for numeric data. One-way Analysis of Variance (ANOVA) and Kruskal-Wallis tests were used to determine the difference between means (normally-distributed) and median (non-normally distributed) data respectively. (b) Significant between-group difference (p < 0.05) on the number of missing values (c) Statistically significant at α = 0.01.</p>
</table-wrap-foot>
</table-wrap>
<fig id="Fig3">
<label>Figure 3</label>
<caption>
<p>Flowchart of data analysis of the validation dataset.</p>
</caption>
<graphic xlink:href="41598_2017_7111_Fig3_HTML" id="d29e1910"></graphic>
</fig>
</p>
<p id="Par26">The discovery corpus consisted of five types of clinical text: (1) MDT meeting reports (N = 77), (2) correspondence from medical oncology clinic (N = 14), (3) anatomical pathology reports (N = 75), and (4) radiology reports of 18F-fluorodeoxyglucose Positron Emission Tomography/Computed Tomography (FDG-PET/CT, N = 74), (5) All of the above clinical text (N = 82) including other non-cancer-specific EMR, including correspondence from other specialties, non-oncology radiology reports, administrative records).</p>
</sec>
<sec id="Sec20">
<title>Qualitative analyses of text features associated with HPV/P16 status in HNSCC patients</title>
<sec id="Sec21">
<title>Exploratory analysis of MDT meeting reports</title>
<p id="Par27">The top binary feature (text fragment) associated with HPV-related HNSCC was “
<italic>base of</italic>
” (OR: 10.5, p = 4.1 × 10
<sup>−5</sup>
, pattern S2a.1) which was part of the phrase “
<italic>base of tongue</italic>
”. This was followed by “
<italic>the right tonsil</italic>
” (OR: 22.9, p = 0.0012, S2a.2), “
<italic>M0</italic>
,” (OR: 20.5, p = 0.0023, S2a.3), and “
<italic>positive</italic>
” (OR: 5.7, p = 0.0029, S2a.4), which were indicative of disease site, stage, and part of phrase “
<italic>HPV/P16 positive</italic>
” respectively. The full list of patterns is described in Table 
<xref rid="MOESM1" ref-type="media">S2a</xref>
.</p>
<p id="Par28">After application of regular expression induction algorithm, the list became more informative. Regular expressions describing the site of disease (e.g. “
<italic>the</italic>
(
<italic>right|left</italic>
)
<italic>? base of tongue</italic>
”, S2c.2 and “
<italic>SCC of the right</italic>
(
<italic>tonsil|base of tongue|glossotonsillar sulcus</italic>
)
<italic>-</italic>
”, S2c.9), treatment modality (S2c.7), and HPV/P16 status (S2c.17) were discovered. A phrase describing the most likely disease stage in HPV-related cases (“(
<italic>T3 N2c|T1 N2b|cT1 N2a</italic>
)
<italic>M0</italic>
”, i.e. non-metastatic disease with low T- but high N-stage) was identified at a more liberal filtering threshold (OR: 11.9, p = 0.038).</p>
<p id="Par29">Text features associated with HPV-unrelated disease were also extractable from the MDT meeting reports (Table 
<xref rid="MOESM1" ref-type="media">S2e–h</xref>
). At a first glance, the majority of unigrams was not seemingly interpretable. However, a close examination of the corpus text showed that these words were either part of a conserved expression or words embedded within a group of concepts. For instance, the word “
<italic>management</italic>
” (S2e.1) referred to a number of phrases describing upfront surgery (e.g. “
<italic>Initial management will require</italic>
<italic>dissection</italic>
”, “
<italic>Initial management</italic>
<italic>surgical</italic>
”, 4 of 7 cases). The word “
<italic>than</italic>
” was associated with concept of ever-consumed alcohol (part of “
<italic>consumed less/more than x gram of alcohol</italic>
”, S2e.2). The fragment “
<italic>disease with</italic>
” (S2e.6) was part of phrases “
<italic>ischaemic heart disease with</italic>
…” (N = 4) and “
<italic>peripheral vascular disease with</italic>
…” (N = 2), indicating a composite concept of advanced atherosclerotic disease. Again, the induction of regular expression produced more interpretable concepts than simple
<italic>n</italic>
-gram fragments (Table 
<xref rid="MOESM1" ref-type="media">S2f and g</xref>
).</p>
<p id="Par30">The volcano plot is shown in Fig. 
<xref rid="Fig4" ref-type="fig">4</xref>
, and a list of informative patterns is summarised in Table 
<xref rid="Tab2" ref-type="table">2</xref>
.
<fig id="Fig4">
<label>Figure 4</label>
<caption>
<p>Volcano plot showing the ranking text features associated with HPV status discovered from the HNSCC MDT reports. Note: Labels of patterns with p < 0.002 are shown in this plot. Legend: ◆: regular expression. ∙: n-gram text fragments. The pattern of regular expression “(
<italic>A|B</italic>
)” indicates either
<italic>A</italic>
or
<italic>B</italic>
would match the string, and “?” indicates an optional element. The size of diamond or circle is proportional to total number of cases mentioning the text patterns in the EMR.</p>
</caption>
<graphic xlink:href="41598_2017_7111_Fig4_HTML" id="d29e2042"></graphic>
</fig>
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Informative features associated with HNSCC by HPV status as discovered by TEPAPA.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th>Log (OR)</th>
<th>P</th>
<th>N</th>
<th>Text feature</th>
<th>Type</th>
<th>EMR Source</th>
<th>Interpretation</th>
<th>Crossref.</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="8">
<bold>Informative features associated with HPV-related HNSCC</bold>
</td>
</tr>
<tr>
<td>3.50</td>
<td>3.0 × 10
<sup>−6</sup>
</td>
<td>25</td>
<td>“HPV (studies|genotypes|status):? P16 immunohistochemistry:? Positive”</td>
<td>R</td>
<td>Pathology</td>
<td>HPV status (Self-referent)</td>
<td>(S3c.1)</td>
</tr>
<tr>
<td>3.89</td>
<td>6.2 × 10
<sup>−6</sup>
</td>
<td>20</td>
<td>“HPV (positive|genotypes: Positive|associated squamous cell carcinoma|related).”</td>
<td>R</td>
<td>Pathology</td>
<td>HPV status (Self-referent)</td>
<td>(S3c.2)</td>
</tr>
<tr>
<td>3.29</td>
<td>2.0 × 10
<sup>−5</sup>
</td>
<td>23</td>
<td>“No FDG avid? pulmonary (nodules|nodule) or pleural”</td>
<td>R</td>
<td>PET</td>
<td>(Lack of) metastasis to the lung</td>
<td>(S4c.1)</td>
</tr>
<tr>
<td>3.14</td>
<td>5.6 × 10
<sup>−5</sup>
</td>
<td>21</td>
<td>“HPV related”</td>
<td>N</td>
<td>Pathology</td>
<td>HPV status (Self-referent)</td>
<td>(S3b.6)</td>
</tr>
<tr>
<td>2.06</td>
<td>0.00094</td>
<td>24</td>
<td>“irradiation (and|with) (or without|concurrent) chemotherapy”</td>
<td>R</td>
<td>MDT</td>
<td>Management</td>
<td>(S2c.7)</td>
</tr>
<tr>
<td>2.76</td>
<td>0.0093</td>
<td>9</td>
<td>“oropharyngectomy:”</td>
<td>N</td>
<td>Pathology</td>
<td>Management, site of primary tumor</td>
<td>(S3a.22)</td>
</tr>
<tr>
<td>3.23</td>
<td>0.0011</td>
<td>13</td>
<td>“SCC of the right (tonsil|base of tongue|glossotonsillar sulcus) -”</td>
<td>R</td>
<td>MDT</td>
<td>Site of primary tumor</td>
<td>(S2d.4)</td>
</tr>
<tr>
<td>2.68</td>
<td>0.0015</td>
<td>16</td>
<td>“SCC of the (right|left)? base of tongue”</td>
<td>R</td>
<td>MDT</td>
<td>Site of primary tumor</td>
<td>(S2d.5)</td>
</tr>
<tr>
<td>3.02</td>
<td>0.0023</td>
<td>11</td>
<td>“M0”</td>
<td>N</td>
<td>MDT</td>
<td>Stage</td>
<td>(S2a.3)</td>
</tr>
<tr>
<td>2.89</td>
<td>0.0047</td>
<td>10</td>
<td>“non-keratinising”</td>
<td>N</td>
<td>Pathology</td>
<td>Pathology feature</td>
<td>(S3a.16)</td>
</tr>
<tr>
<td>2.77</td>
<td>0.0092</td>
<td>9</td>
<td>“p16? positive,? HPV? positive”</td>
<td>R</td>
<td>MDT</td>
<td>HPV status (Self-referent)</td>
<td>(S2c.17)</td>
</tr>
<tr>
<td colspan="8">
<bold>Informative features associated with HPV-unrelated HNSCC</bold>
</td>
</tr>
<tr>
<td>−3.54</td>
<td>0.00035</td>
<td>8</td>
<td>“for decalcification”</td>
<td>N</td>
<td>Pathology</td>
<td>Pathology feature</td>
<td>(S3e.2)</td>
</tr>
<tr>
<td>−2.91</td>
<td>0.00089</td>
<td>10</td>
<td>“a (locally|locoregionally)? (p16 negative|advanced) SCC”</td>
<td>R</td>
<td>MDT</td>
<td>HPV status and pathology feature</td>
<td>(S2h.3)</td>
</tr>
<tr>
<td>−3.17</td>
<td>0.0031</td>
<td>6</td>
<td>“SCC of the supraglottic? (lower lip|larynx).”</td>
<td>R</td>
<td>MDT</td>
<td>Site of primary tumor</td>
<td>(S2g.7)</td>
</tr>
<tr>
<td>−2.96</td>
<td>0.0086</td>
<td>5</td>
<td>“likely to? require adjuvant radiation therapy”</td>
<td>R</td>
<td>MDT</td>
<td>Management</td>
<td>(S2g.10)</td>
</tr>
<tr>
<td>−3.35</td>
<td>0.0011</td>
<td>7</td>
<td>supportive care</td>
<td>N</td>
<td>MDT</td>
<td>Management</td>
<td>(S2f.3)</td>
</tr>
<tr>
<td>−2.59</td>
<td>0.0058</td>
<td>8</td>
<td>“differentiated, keratinising squamous cell carcinoma”</td>
<td>N</td>
<td>Pathology</td>
<td>Pathology feature</td>
<td>(S3e.23)</td>
</tr>
<tr>
<td>−2.59</td>
<td>0.0058</td>
<td>8</td>
<td>“well differentiated”</td>
<td>N</td>
<td>Pathology</td>
<td>Pathology feature</td>
<td>(S3e.26)</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Note: The type field indicates the type of text features (N: n-gram fragments or R: regular expression). N indicates number of documents containing the text features. Abbreviations: Log (OR): Log odds ratio. MDT: Multidisciplinary team meeting.</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="Sec22">
<title>Exploratory analysis on other sub-corpora</title>
<p id="Par31">The analysis of pathology reports identified text fragments describing the results of HPV/P16 assay as the ranking feature (“:
<italic>Positive</italic>
” and “
<italic>: Negative</italic>
”, S3a.1 and S3e.1), among other relevant factors (Tables 
<xref rid="Tab2" ref-type="table">2</xref>
and
<xref rid="MOESM1" ref-type="media">S3</xref>
). Likewise, the sites of primary tumour (S3a.2–4) and the associated concepts (e.g. “
<italic>for decalcification</italic>
”, S3e.2, indicating the need to process bony surgical specimen for microscopic examination, thus less likely to be at an oropharyngeal site) were also identified. In the FDG-PET/CT reports, we found conflicting results describing abnormal pulmonary nodules where two phrases describing both the presence and absence of associations with lung metastasis were found (e.g. S4c.1 and S4c.7). Further examinations of the EMR text showed that the negation qualifiers were not captured due to lexical variations (e.g. “not” or “no evidence of”), and the negative concepts appeared to be more conserved in its expression. An analysis of oncology correspondences did not yield statistically significant entries at α = 0.025.</p>
</sec>
</sec>
<sec id="Sec23">
<title>Qualitative comparison of discovered concepts with epidemiological literature</title>
<p id="Par32">A practical measure of quality of discovery is to compare the algorithmically discovered concepts against published literature (Table 
<xref rid="Tab3" ref-type="table">3</xref>
). In this analysis, our pipeline consistently recovered concepts associated with primary tumour site, the commonest anatomical staging at presentation, and the primary treatment modality in association with a patient’s HPV status. Indirect associations of cigarette and alcohol exposure, cardiovascular comorbidities were also described. From the pathology reports, TEPAPA identified histological grade, non-keratinising epithelium, morphology, and lack of epithelial dysplasia as features correlated to HPV-related disease. While patients with HPV-related disease are known to have a more favourable prognosis
<sup>
<xref ref-type="bibr" rid="CR36">36</xref>
</sup>
, survival data was not available for examination. Sexual and marijuana history were not recorded in the EMR, and comorbidities were also inconsistently documented.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Literature-based comparison of features associated with HNSCC by HPV status.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="2">Variables</th>
<th colspan="2">HPV status</th>
<th>Examples of highly-ranked, informative features</th>
<th rowspan="2">Reference</th>
</tr>
<tr>
<th>HPV-related</th>
<th>HPV-unrelated</th>
<th>Log(OR), P-value (Crossref.)</th>
</tr>
</thead>
<tbody>
<tr>
<td colspan="5">
<bold>Demographics</bold>
</td>
</tr>
<tr>
<td>Age</td>
<td>Younger</td>
<td>Older</td>
<td>(
<italic>Not identified</italic>
)</td>
<td>
<xref ref-type="bibr" rid="CR24">24</xref>
</td>
</tr>
<tr>
<td>Married</td>
<td>Associated</td>
<td>NS</td>
<td>(
<italic>Not consistently documented in EMR</italic>
)</td>
<td>
<xref ref-type="bibr" rid="CR25">25</xref>
</td>
</tr>
<tr>
<td colspan="5">
<bold>Social History</bold>
</td>
</tr>
<tr>
<td>Cigarette and alcohol exposure</td>
<td>Associated</td>
<td>Strongly associated</td>
<td>
<bold>than</bold>
” *Log(OR) = −2.75, P = 0.0024 (S2e.2)</td>
<td>
<xref ref-type="bibr" rid="CR24">24</xref>
<xref ref-type="bibr" rid="CR26">26</xref>
</td>
</tr>
<tr>
<td>Marijuana use</td>
<td>Associated</td>
<td>Associated</td>
<td>(
<italic>Not documented in EMR</italic>
)</td>
<td>
<xref ref-type="bibr" rid="CR25">25</xref>
</td>
</tr>
<tr>
<td>Poor oral hygiene (incl. tooth loss)</td>
<td>Not associated</td>
<td>Associated</td>
<td>
<bold>is (…|a restored dentition|…|edentulous|…)</bold>
.”
<sup></sup>
Log(OR) = −1.43, P = 0.0051 (S2h.7)</td>
<td>
<xref ref-type="bibr" rid="CR25">25</xref>
,
<xref ref-type="bibr" rid="CR26">26</xref>
</td>
</tr>
<tr>
<td colspan="5">
<bold>Sexual history</bold>
</td>
</tr>
<tr>
<td>Oral sex partners</td>
<td>Associated</td>
<td>NS</td>
<td>(
<italic>Not documented in EMR</italic>
)</td>
<td>
<xref ref-type="bibr" rid="CR24">24</xref>
<xref ref-type="bibr" rid="CR27">27</xref>
,
<xref ref-type="bibr" rid="CR29">29</xref>
</td>
</tr>
<tr>
<td>Number of lifetime sexual partners</td>
<td>Associated</td>
<td>NS</td>
<td>(
<italic>Not documented in EMR</italic>
)</td>
<td>
<xref ref-type="bibr" rid="CR25">25</xref>
,
<xref ref-type="bibr" rid="CR27">27</xref>
,
<xref ref-type="bibr" rid="CR29">29</xref>
</td>
</tr>
<tr>
<td colspan="5">
<bold>Comorbidities</bold>
</td>
</tr>
<tr>
<td>Cardiovascular</td>
<td>Risk factors (e.g. Hypertension)</td>
<td>Macrovascular arthrosclerotic disease</td>
<td>
<bold>disease with</bold>
” Log(OR) = −3.2, P = 0.0031 (S2e.6)*</td>
<td>
<xref ref-type="bibr" rid="CR25">25</xref>
</td>
</tr>
<tr>
<td>Primary tumor site</td>
<td>Oropharynx</td>
<td>Non-oropharynx</td>
<td>
<bold>SCC of the right (tonsil|base of tongue|glossotonsillar sulcus)</bold>
” - Log(OR) = 3.23, P = 0.0011(S2d.4) “
<bold>SCC of the (right|left)? base of tongue</bold>
” Log(OR) = 2.68, P = 0.0015 (S2d.5)</td>
<td>
<xref ref-type="bibr" rid="CR24">24</xref>
,
<xref ref-type="bibr" rid="CR26">26</xref>
,
<xref ref-type="bibr" rid="CR28">28</xref>
,
<xref ref-type="bibr" rid="CR30">30</xref>
<xref ref-type="bibr" rid="CR32">32</xref>
</td>
</tr>
<tr>
<td colspan="5">
<bold>Anatomical stage</bold>
</td>
</tr>
<tr>
<td>T stage</td>
<td>Early T-stage</td>
<td rowspan="2"></td>
<td rowspan="2">
<bold>M0</bold>
,” Log(OR) = 3.02 p = 0.002 (S2a.3) “
<bold>((T3 N2c)|(T1 N2b)|(cT1 N2a)) M0</bold>
” Log(OR) = 2.48, P = 0.038 “
<bold>a large single lymph node exhibiting metastatic cystic? moderately differentiated non-keratinising? squamous cell carcinoma</bold>
”. Log(OR) = 2.89, P = 0.0047(S3d.53)</td>
<td>
<xref ref-type="bibr" rid="CR33">33</xref>
</td>
</tr>
<tr>
<td>Nodal status</td>
<td>Multilevel, “High N-stage” Cystic nodes</td>
<td>
<xref ref-type="bibr" rid="CR24">24</xref>
,
<xref ref-type="bibr" rid="CR30">30</xref>
,
<xref ref-type="bibr" rid="CR33">33</xref>
,
<xref ref-type="bibr" rid="CR34">34</xref>
</td>
</tr>
<tr>
<td colspan="5">
<bold>Pathology features</bold>
</td>
</tr>
<tr>
<td>Grade</td>
<td>Moderately to poorly differentiated</td>
<td>Moderately differentiated</td>
<td rowspan="3">
<bold>non-keratinising</bold>
” Log(OR) = 2.9, P = 0.0047 (S3a.16), “(
<bold>poorly differentiated|non-keratinizing|non-keratinising|focally keratinizing)? squamous cell carcinoma</bold>
” Log(OR) = 2.67 P = 0.0015 (S3c.35)
<bold>of (…|basaloid type/Non-keratinizing|…)</bold>
.
<sup></sup>
Log(OR) = 2.87, P = 0.00062 (S3c.19) “
<bold>with (high|low)? (grade|mild) dysplasia</bold>
” Log(OR) = −3.37, P = 0.001(S3g.11)</td>
<td>
<xref ref-type="bibr" rid="CR26">26</xref>
,
<xref ref-type="bibr" rid="CR30">30</xref>
</td>
</tr>
<tr>
<td>Keratinisation</td>
<td>Absent</td>
<td>Present</td>
<td>
<xref ref-type="bibr" rid="CR26">26</xref>
</td>
</tr>
<tr>
<td>Other features</td>
<td>Basaloid morphology</td>
<td>Epithelial dysplasia</td>
<td>
<xref ref-type="bibr" rid="CR26">26</xref>
,
<xref ref-type="bibr" rid="CR28">28</xref>
,
<xref ref-type="bibr" rid="CR32">32</xref>
</td>
</tr>
<tr>
<td colspan="5">
<bold>Management</bold>
</td>
</tr>
<tr>
<td>Locally advanced disease (T3/4 or N2/3)</td>
<td>Surgery + adjuvant radiotherapy +/− concurrent chemotherapy</td>
<td></td>
<td>
<bold>irradiation (and|with) (or without|concurrent) chemotherapy</bold>
” Log(OR) = 2.06, P = 0.00094 (S2c.7)</td>
<td>
<xref ref-type="bibr" rid="CR35">35</xref>
</td>
</tr>
<tr>
<td colspan="5">
<bold>Treatment outcome</bold>
</td>
</tr>
<tr>
<td>Overall survival</td>
<td>Better prognosis</td>
<td>Poorer prognosis</td>
<td>(
<italic>Not assessable by this dataset</italic>
)</td>
<td>
<xref ref-type="bibr" rid="CR36">36</xref>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Abbreviations: NS: Not significant. Log(OR): Log odds ratio; Note: *Refers to part of “
<italic>consumed</italic>
(
<italic>greater|less</italic>
)
<italic>than</italic>
”, which was a phrase used to describe “ever-consumption of alcohol”.
<sup></sup>
The index concept was revealed only through “overfitting” the concept to a regular expression pattern flanked by two tokens. See main text for detailed discussions.</p>
</table-wrap-foot>
</table-wrap>
</p>
<p id="Par33">We have found that the regular expression induction algorithm can meaningfully group closely related concepts together if they are flanked by a pair of highly specific phrases (e.g. “
<italic>SCC of</italic>
<italic>base of tongue</italic>
”, S2d.5), but less so if the flanking texts are made up of common words. For instance, the concepts related to poor oral hygiene (“
<italic>restored dentition</italic>
” and “
<italic>edentulous</italic>
”) were admixed with other unrelated concepts (S2h.7) as a result of overfitting the training data to non-specific text pattern “
<italic>of</italic>
…. ”.</p>
</sec>
<sec id="Sec24">
<title>Binary and numeric features associated with other clinical variables</title>
<p id="Par34">Exploratory analyses of other clinicopathologic variables were performed to demonstrate the generalisability of method (Table 
<xref rid="MOESM1" ref-type="media">S5</xref>
). The pipeline found the phrases “
<italic>He</italic>
” (p = 1.7 × 10
<sup>−13</sup>
, S5.1) and “
<italic>She is</italic>
” (p = 1.6 × 10
<sup>−15</sup>
, S5.3) being associated with patient’s gender. The age of patient was associated with mentions of “
<italic>chronic</italic>
” (AUC: 0.75, p = 0.00011, S5.7) and “
<italic>retired</italic>
” (AUC: 0.76, p = 0.00025, S5.8). Elderly patients were more likely to have a chest X-ray performed with an anterior-posterior projection (AUC: 0.85, p = 7.6 × 10
<sup>−6</sup>
, S5.6), suggesting a more complicated post-operative course in this population. Descriptors of recurrent cases were recovered (S5.18–21). Regular expressions describing nodal status, which were explainable by phrases summarizing the extra-nodal spread (S5.13 and S5.15), nodal stage (S5.14), and the phrase “
<italic>there is no lymphadenopathy</italic>
” (S5.17) were identified. A conserved regular expression associated with smoking status was found (e.g., “
<italic>a cigarette/heavy/current smoker</italic>
”, S5.29). Ever-smokers were characterised by the regular expression “
<italic>a</italic>
(
<italic>reformed</italic>
)
<italic>? cigarette/heavy/current smoker</italic>
” (the question mark denotes an optional word, S5.36). Current users of alcohol were associated with the use of a quantification phrase “
<italic>g of alcohol daily</italic>
” (S5.38). Phrases associated with patients who have never consumed alcohol have also been identified (S5.39).</p>
<p id="Par35">The age at diagnosis was perfectly correlated to a structured numeric field in the MDT report recording the patient’s age (p = 1.4 × 10
<sup>−37</sup>
, S6.1). The maximum Standardised Uptake Value (SUVMax) of a lesion on FDG-PET/CT was negatively associated with advanced age (ρ = −0.69, p = 0.00087, S6.3). The amount of alcohol consumed by the patient was also extractable (S6.4). The HPV-related cases were more likely to have higher localised SUVMax values (S6.8). Smoking cessation was associated with the phrase “〈
<italic>number</italic>
<italic>pack</italic>
” (p = 3.3 × 10
<sup>−5</sup>
, S6.6).</p>
</sec>
<sec id="Sec25">
<title>Phenotyping of HPV/P16 status using features learned from EMR text</title>
<p id="Par36">With all sub-corpora included, the HPV/P16 status could be classified with an overall AUC of 0.861 using EMR narratives alone. While a relationship between the parameters and accuracy was not distinct, the type of text and filtering threshold appeared to be important (Table 
<xref rid="Tab4" ref-type="table">4</xref>
and Figure 
<xref rid="MOESM1" ref-type="media">S8</xref>
). As expected, pathology reports, the most likely sub-corpus containing HPV/P16 status, topped among the four sub-corpora. Multiple regression analysis suggested that sequence-level annotation, stemming, and UMLS annotation were more likely to yield an improved performance (except for FDG-PET/CT reports). For predictions based on pathology reports, Naive Bayes was numerically superior to ADTree, although in general the performance was comparable across classifiers. Regular expression induction did not improve accuracy in more specialised sub-corpora. The combinatorial search methods (POSTAG and SPARSE) were unable to complete at the predefined resource limit for bootstrapping analysis when the entire corpus was used for discovery.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>Predictive performance by varying methods annotation type, threshold selection, and machine learning methods.</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th rowspan="3">Pipeline variations</th>
<th colspan="10">Corpus type</th>
</tr>
<tr>
<th colspan="2">MDT meeting reports (N = 77)</th>
<th colspan="2">Oncology letters (N = 14)</th>
<th colspan="2">Pathology reports (N = 75)</th>
<th colspan="2">FDG-PET/CT reports (N = 74)</th>
<th colspan="2">All inclusive (N = 82)</th>
</tr>
<tr>
<th>Est.</th>
<th>P</th>
<th>Est.</th>
<th>P</th>
<th>Est.</th>
<th>P</th>
<th>Est.</th>
<th>P</th>
<th>Est.</th>
<th>P</th>
</tr>
</thead>
<tbody>
<tr>
<td>Mean (Intercept)</td>
<td>0.634</td>
<td></td>
<td>0.559</td>
<td></td>
<td>0.835</td>
<td></td>
<td>0.759</td>
<td></td>
<td>0.861</td>
<td></td>
</tr>
<tr>
<td colspan="11">Annotation method</td>
</tr>
<tr>
<td> None</td>
<td>(Ref.)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td> POSTAG</td>
<td>0.006</td>
<td>0.13</td>
<td>0.031</td>
<td><0.001</td>
<td>−0.043</td>
<td><0.001</td>
<td>−0.062</td>
<td><0.001</td>
<td>NA</td>
<td></td>
</tr>
<tr>
<td> STEM</td>
<td>0.010</td>
<td>0.009</td>
<td>0.011</td>
<td>0.05</td>
<td>0.005</td>
<td>0.08</td>
<td>0.017</td>
<td><0.001</td>
<td>0.013</td>
<td>0.059</td>
</tr>
<tr>
<td> SPARSE</td>
<td>−0.017</td>
<td><0.001</td>
<td>0.056</td>
<td><0.001</td>
<td>0.004</td>
<td>0.17</td>
<td>−0.005</td>
<td>0.32</td>
<td>NA</td>
<td></td>
</tr>
<tr>
<td> UMLS</td>
<td>0.013</td>
<td><0.001</td>
<td>0.030</td>
<td><0.001</td>
<td>0.004</td>
<td>0.17</td>
<td>−0.190</td>
<td><0.001</td>
<td>0.014</td>
<td><0.001</td>
</tr>
<tr>
<td colspan="11">Post-processing</td>
</tr>
<tr>
<td> None</td>
<td>(Ref.)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td> REGEXI</td>
<td>−0.003</td>
<td>0.17</td>
<td>0.003</td>
<td>0.44</td>
<td>−0.003</td>
<td>0.09</td>
<td>−0.002</td>
<td>0.50</td>
<td>0.007</td>
<td>0.018</td>
</tr>
<tr>
<td colspan="11">Machine learning algorithm</td>
</tr>
<tr>
<td> ADTree</td>
<td>(Ref.)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td> Logistic regression</td>
<td>−0.0002</td>
<td>0.94</td>
<td>0.015</td>
<td><0.001</td>
<td>−0.007</td>
<td>0.006</td>
<td>−0.003</td>
<td>0.38</td>
<td>−0.017</td>
<td><0.001</td>
</tr>
<tr>
<td> Naive Bayes</td>
<td>0.005</td>
<td>0.10</td>
<td>0.018</td>
<td><0.001</td>
<td>0.018</td>
<td><0.001</td>
<td>0.003</td>
<td>0.38</td>
<td>0.006</td>
<td>0.126</td>
</tr>
<tr>
<td colspan="11">Threshold selection</td>
</tr>
<tr>
<td> Optimal threshold</td>
<td>(Ref.)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td> −
<italic>log</italic>
<sub>10</sub>
deviation from the optimal threshold</td>
<td>−0.022</td>
<td><0.001</td>
<td>−0.040</td>
<td><0.001</td>
<td>−0.013</td>
<td><0.001</td>
<td>−0.011</td>
<td><0.001</td>
<td>0.003</td>
<td>0.15</td>
</tr>
<tr>
<td> 
<italic>Adjusted R</italic>
<sup>
<italic>2</italic>
</sup>
</td>
<td colspan="2">0.40</td>
<td colspan="2">0.65</td>
<td colspan="2">0.66</td>
<td colspan="2">0.85</td>
<td colspan="2">0.72</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>NB: Abbreviations: ADTree: Alternating decision tree (10-boosting iterations); FDG-PET/CT:18F-fluorodeoxyglucose Positron Emission Tomography/Computed Tomography; MDT: multidisciplinary team; POSTAG: Part-of-speech tagging with word lemmatization; REGEXI: regular expression induction algorithm; SPARSE: syntactic parsing; STEM: token-level annotation by word stemming using Snowball algorithm; UMLS: sequence-level annotation using Meta-thesaurus from the United Medical Language System (UMLS) version 2016 AA.</p>
</table-wrap-foot>
</table-wrap>
</p>
<p id="Par37">An empirical observation was made such that the computational time was linearly correlated to the corpus size (in characters, r
<sup>2</sup>
 = 0.994, p = 0.0002), conforming to linear time complexity [O(N)]. Annotation with word stemming, part-of-speech tagging, and syntactic parsing generally increased training time, whereas UMLS-based token aggregation generally reduced the computational time (Table 
<xref rid="MOESM1" ref-type="media">S7</xref>
). Variations in the filtering threshold and regular expression induction both produced comparable time usage across different text types.</p>
</sec>
</sec>
<sec id="Sec26" sec-type="discussion">
<title>Discussion</title>
<p id="Par38">The central finding of this research is that clinically relevant associative knowledge is discoverable from EMR text by combining semantic-free NLP methods with association analysis. Our method sensitively identifies key clinicopathologic factors that differentiate subgroups of HNSCC patients by HPV status. Hence, we expect our approach to find useful signals associated with clinical outcomes in other domain. This tool provides an adjunct for efficiently generating new hypotheses guiding downstream investigations for as-yet-unsolved biomedical problem scenarios.</p>
<p id="Par39">This work also highlights the possibility of finding plausible associations using only a relatively small cohort of routinely-collected EMR patient data. Most factors associated with virally-implicated HNSCC have been found through EMR retrieved from a single site. However, the selection of relevant corpus appeared to be important; for example, we found no significant association factors from oncologic correspondence. The lack of association was not unexpected because of the small corpus size, as well as the fact that chemotherapy is only a subsidiary modality for managing non-metastatic HNSCC
<sup>
<xref ref-type="bibr" rid="CR35">35</xref>
</sup>
. Current guidelines also do not yet recommend a different treatment regimen for HPV-related disease, despite speculations for de-intensification in this population
<sup>
<xref ref-type="bibr" rid="CR35">35</xref>
,
<xref ref-type="bibr" rid="CR36">36</xref>
</sup>
.</p>
<p id="Par40">Several strengths of our feature generation and ranking approach suggest useful applications. First, TEPAPA extracts knowledge in the form of clear text and its derivatives, which allows direct transformation of these patterns into searchable formats. The “white-box” approach is advantageous because it allows domain experts to rapidly generate hypotheses and to re-identify contextual information about a case when discrepancies arise, as shown in our analysis. Second, the PatWAS method addresses the “cognitive gaps” which occur at the time of designing an observational study. The unbiased method avoids the problem where a researcher focuses only on a set of familiar variables for testing in an
<italic>ad hoc</italic>
manner, thereby permitting discovery of novel associations. This approach is attractive because most EMR data contain unstructured narratives, and the key concepts may only be described by using non-standardised lexicons. Third, the backbone of our method assumes no underlying knowledge, and thus is expected to work on other biomedical texts, whether formal (e.g. MEDLINE abstracts) or informal (e.g., social network data), to support discovery in distinct settings. Fourth, TEPAPA can find predictive, text-based “informarkers” to allow risk stratification, support
<italic>in silico</italic>
phenotyping tasks, and extract information from EMR. The feasibility of this integrative approach is supported by our predictive analysis.</p>
<p id="Par41">One capability of TEPAPA is to aggregate syntactically similar text fragments into regular expressions to aid data interpretability. In our classification task, however, inclusion of regular expressions did not consistently improve accuracy over that obtained using “bag of token” features alone. Consistent with previous studies, regular expressions provided only a small performance benefit over use of simple word vectors in classification tasks, bearing a weak but correlative trend to the training sample size
<sup>
<xref ref-type="bibr" rid="CR42">42</xref>
,
<xref ref-type="bibr" rid="CR45">45</xref>
</sup>
. Accordingly, methods that aggregate text fragments (as in induction of regular expressions) – although generating features with better sensitivity (recall) - provide little overall additional information when used in conjunction with a multivariate learner for classification and prediction.</p>
<p id="Par42">Although our method appears to provide useful insights into EMR data, the results still need to be scrutinised by domain experts referring back to the original text. To better understand this limitation, we categorised three scenarios of misdiscovery, each of which has a unique characteristic with potential solutions (Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
). Both types I (false positives) and II misdiscoveries (false negatives) can be affected by inappropriate threshold assignment during the feature filtering step. Moreover, type II misdiscovery can result from insufficient information in the corpus. For instance, sexual history was not recorded in our dataset, and was therefore unable to be discovered computationally. Systematic omissions such as this represent an absolute limitation for all types of EMR-based discovery. Type III misdiscovery (wrongly positive) described two related subgroups (incorrect qualifier assignment, IIIA and partial correlated patterns, IIIB). Both problems arise from the algorithm failing to fully examine the underlying semantic structure, resulting in only partial observations. Such misdiscovery represents the ceiling of capability for semantic-free NLP methods, but could be amendable to a richer knowledge representation by incorporating a comprehensive semantic analysis on platforms such as MedLEE
<sup>
<xref ref-type="bibr" rid="CR45">45</xref>
</sup>
and cTAKES
<sup>
<xref ref-type="bibr" rid="CR46">46</xref>
</sup>
during the pre-processing step. A trend was evident from our analysis which suggested that a more sophisticated representation (e.g., regular expression) confers better descriptive power (e.g., versus
<italic>n</italic>
-grams). Incorporating contextual knowledge is thus expected to improve the quality of machine-generated features by considering the linguistic structure more fully.
<fig id="Fig5">
<label>Figure 5</label>
<caption>
<p>Scenarios, examples, and potential sources of misdiscovery.</p>
</caption>
<graphic xlink:href="41598_2017_7111_Fig5_HTML" id="d29e3754"></graphic>
</fig>
</p>
<p id="Par43">Several challenges for future research are clear. First, the optimal method for selecting an objective filtering threshold remains unsolved, as the exhaustive search algorithms guarantee the generation of patterns that are not identically and independently distributed. As such, the conventional methods for adjusting for multiple hypothesis testing, such as Bonferroni
<sup>
<xref ref-type="bibr" rid="CR47">47</xref>
</sup>
and Benjamini-Hochberg corrections
<sup>
<xref ref-type="bibr" rid="CR48">48</xref>
</sup>
, would be unable to identify a suitable cut-off. Second, as in all high-dimensional analysis, overfitting may occur if a pattern is over-calibrated to fit the training data. Incorporating ensemble selection with early-stopping may avoid building an overly-complex model
<sup>
<xref ref-type="bibr" rid="CR49">49</xref>
</sup>
. Third, the caveats of epidemiological research (e.g. biases and confounders) still apply, and asking a relevant clinical question remains paramount. Fourth, downstream of plausible text pattern identification, rigorous confirmatory studies remain necessary before drawing a definitive clinical conclusion; EMR-based analyses inherently suffer from bias, noise, missing data, and inconsistency
<sup>
<xref ref-type="bibr" rid="CR50">50</xref>
<xref ref-type="bibr" rid="CR53">53</xref>
</sup>
. Fifth, features extracted by TEPAPA are presented in conventional statistical quantities that are widely accepted by the clinical community (e.g., odds ratio, AUC, and p-value). While this application-oriented approach may help to generate new hypotheses for clinical research, alternative feature selection algorithms and regularised variable regression methods (e.g., elastic net)
<sup>
<xref ref-type="bibr" rid="CR54">54</xref>
</sup>
may be better suited to select patterns for building multivariate models for classification. More research is thus needed to identify how to best combine feature generation and selection methods in the context of clinical text classification. Last but not least, meticulous removal of patient identifiers is required to avoid inadvertent breaches of patient privacy, particularly in a data-sharing environment.</p>
<p id="Par44">In conclusion, we have developed a novel computational pipeline for systematically identifying hitherto-unrecognised covariates from EMR narratives through associative text-mining analyses. Our results support the clinical and translational research use of TEPAPA and its future derivatives in efficiently extracting
<italic>de novo</italic>
knowledge and hypotheses from EMR in the background.</p>
<sec id="Sec27">
<title>Data Availability</title>
<p id="Par45">The source code of TEPAPA can be obtained from http://tepapadiscoverer.org/.</p>
</sec>
</sec>
<sec sec-type="supplementary-material">
<title>Electronic supplementary material</title>
<sec id="Sec28">
<p>
<supplementary-material content-type="local-data" id="MOESM1">
<media xlink:href="41598_2017_7111_MOESM1_ESM.pdf">
<caption>
<p>Supplementary Information</p>
</caption>
</media>
</supplementary-material>
</p>
</sec>
</sec>
</body>
<back>
<fn-group>
<fn>
<p>
<bold>Electronic supplementary material</bold>
</p>
<p>
<bold>Supplementary information</bold>
accompanies this paper at doi:10.1038/s41598-017-07111-0 </p>
</fn>
<fn>
<p>
<bold>Publisher's note:</bold>
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p>
</fn>
</fn-group>
<ack>
<title>Acknowledgements</title>
<p>The authors thank Anthony Joshua and Marcel Dinger for critical comments, and John Grygiel for departmental support. An early version of this work was presented as a Scientific Poster at the Sydney Cancer Conference 2016 (22–23 September 2016, Australian Technology Park, Sydney, Australia). FL is supported by John Shine Translational Research Fellowship 2016, Garvan Institute of Medical Research.</p>
</ack>
<notes notes-type="author-contribution">
<title>Author Contributions</title>
<p>F.L. designed the study, programmed the TEPAPA pipeline, performed data analysis, and wrote the initial manuscript. F.L. and C.T. contributed to data collection and literature review. F.L., A.P., and C.T. performed data cleaning and verification. R.E. is the senior author who supervised the study. All authors (F.L., A.P., C.T., and R.E.) contributed to data interpretation and critically revised the manuscript.</p>
</notes>
<notes notes-type="COI-statement">
<sec id="FPar1">
<title>Competing Interests</title>
<p id="Par46">The authors declare that they have no competing interests.</p>
</sec>
</notes>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<label>1.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Frankovich</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Longhurst</surname>
<given-names>CA</given-names>
</name>
<name>
<surname>Sutherland</surname>
<given-names>SM</given-names>
</name>
</person-group>
<article-title>Evidence-based medicine in the EMR era</article-title>
<source>N. Engl. J. Med.</source>
<year>2011</year>
<volume>365</volume>
<fpage>1758</fpage>
<lpage>1759</lpage>
<pub-id pub-id-type="doi">10.1056/NEJMp1108726</pub-id>
<pub-id pub-id-type="pmid">22047518</pub-id>
</element-citation>
</ref>
<ref id="CR2">
<label>2.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zheng</surname>
<given-names>K</given-names>
</name>
<name>
<surname>Mei</surname>
<given-names>Q</given-names>
</name>
<name>
<surname>Hanauer</surname>
<given-names>DA</given-names>
</name>
</person-group>
<article-title>Collaborative search in electronic health records</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2011</year>
<volume>18</volume>
<fpage>282</fpage>
<lpage>291</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2011-000009</pub-id>
<pub-id pub-id-type="pmid">21486887</pub-id>
</element-citation>
</ref>
<ref id="CR3">
<label>3.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kahn</surname>
<given-names>MG</given-names>
</name>
<name>
<surname>Weng</surname>
<given-names>C</given-names>
</name>
</person-group>
<article-title>Clinical research informatics: a conceptual perspective</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2012</year>
<volume>19</volume>
<issue>e1</issue>
<fpage>e36</fpage>
<lpage>42</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2012-000968</pub-id>
<pub-id pub-id-type="pmid">22523344</pub-id>
</element-citation>
</ref>
<ref id="CR4">
<label>4.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chute</surname>
<given-names>CG</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Some experiences and opportunities for big data in translational research</article-title>
<source>Genet. Med.</source>
<year>2013</year>
<volume>15</volume>
<fpage>802</fpage>
<lpage>809</lpage>
<pub-id pub-id-type="doi">10.1038/gim.2013.121</pub-id>
<pub-id pub-id-type="pmid">24008998</pub-id>
</element-citation>
</ref>
<ref id="CR5">
<label>5.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sledge</surname>
<given-names>GW</given-names>
</name>
<etal></etal>
</person-group>
<article-title>ASCO’s approach to a learning health care system in oncology</article-title>
<source>J. Oncol. Pract.</source>
<year>2013</year>
<volume>9</volume>
<fpage>145</fpage>
<lpage>148</lpage>
<pub-id pub-id-type="doi">10.1200/JOP.2013.000957</pub-id>
<pub-id pub-id-type="pmid">23942494</pub-id>
</element-citation>
</ref>
<ref id="CR6">
<label>6.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abernethy</surname>
<given-names>AP</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Rapid-learning system for cancer care</article-title>
<source>J. Clin. Oncol.</source>
<year>2010</year>
<volume>28</volume>
<fpage>4268</fpage>
<lpage>4274</lpage>
<pub-id pub-id-type="doi">10.1200/JCO.2010.28.5478</pub-id>
<pub-id pub-id-type="pmid">20585094</pub-id>
</element-citation>
</ref>
<ref id="CR7">
<label>7.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shrager</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Tenenbaum</surname>
<given-names>JM</given-names>
</name>
</person-group>
<article-title>Rapid learning for precision oncology</article-title>
<source>Nat. Rev. Clin. Oncol.</source>
<year>2014</year>
<volume>11</volume>
<fpage>109</fpage>
<lpage>118</lpage>
<pub-id pub-id-type="doi">10.1038/nrclinonc.2013.244</pub-id>
<pub-id pub-id-type="pmid">24445514</pub-id>
</element-citation>
</ref>
<ref id="CR8">
<label>8.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jensen</surname>
<given-names>PB</given-names>
</name>
<name>
<surname>Jensen</surname>
<given-names>LJ</given-names>
</name>
<name>
<surname>Brunak</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Mining electronic health records: towards better research applications and clinical care</article-title>
<source>Nat. Rev. Genet.</source>
<year>2012</year>
<volume>13</volume>
<fpage>395</fpage>
<lpage>405</lpage>
<pub-id pub-id-type="doi">10.1038/nrg3208</pub-id>
<pub-id pub-id-type="pmid">22549152</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<label>9.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kho</surname>
<given-names>AN</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2012</year>
<volume>19</volume>
<fpage>212</fpage>
<lpage>218</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2011-000439</pub-id>
<pub-id pub-id-type="pmid">22101970</pub-id>
</element-citation>
</ref>
<ref id="CR10">
<label>10.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Warner</surname>
<given-names>JL</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Temporal phenome analysis of a large electronic health record cohort enables identification of hospital-acquired complications</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2013</year>
<volume>20</volume>
<fpage>e281</fpage>
<lpage>e287</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2013-001861</pub-id>
<pub-id pub-id-type="pmid">23907284</pub-id>
</element-citation>
</ref>
<ref id="CR11">
<label>11.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Denny</surname>
<given-names>JC</given-names>
</name>
<etal></etal>
</person-group>
<article-title>PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations</article-title>
<source>Bioinformatics.</source>
<year>2010</year>
<volume>26</volume>
<fpage>1205</fpage>
<lpage>1210</lpage>
<pub-id pub-id-type="doi">10.1093/bioinformatics/btq126</pub-id>
<pub-id pub-id-type="pmid">20335276</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<label>12.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ritchie</surname>
<given-names>MD</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Genome- and phenome-wide analyses of cardiac conduction identifies markers of arrhythmia risk</article-title>
<source>Circulation.</source>
<year>2013</year>
<volume>127</volume>
<fpage>1377</fpage>
<lpage>1385</lpage>
<pub-id pub-id-type="doi">10.1161/CIRCULATIONAHA.112.000604</pub-id>
<pub-id pub-id-type="pmid">23463857</pub-id>
</element-citation>
</ref>
<ref id="CR13">
<label>13.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Denny</surname>
<given-names>JC</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data</article-title>
<source>Nat. Biotechnol.</source>
<year>2013</year>
<volume>31</volume>
<fpage>1102</fpage>
<lpage>1110</lpage>
<pub-id pub-id-type="doi">10.1038/nbt.2749</pub-id>
<pub-id pub-id-type="pmid">24270849</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<label>14.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wei</surname>
<given-names>WQ</given-names>
</name>
<name>
<surname>Denny</surname>
<given-names>JC</given-names>
</name>
</person-group>
<article-title>Extracting research-quality phenotypes from electronic health records to support precision medicine</article-title>
<source>Genome Med.</source>
<year>2015</year>
<volume>7</volume>
<fpage>41</fpage>
<pub-id pub-id-type="doi">10.1186/s13073-015-0166-y</pub-id>
<pub-id pub-id-type="pmid">25937834</pub-id>
</element-citation>
</ref>
<ref id="CR15">
<label>15.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kohane</surname>
<given-names>IS</given-names>
</name>
</person-group>
<article-title>Using electronic health records to drive discovery in disease genomics</article-title>
<source>Nat. Rev. Genet.</source>
<year>2011</year>
<volume>12</volume>
<fpage>417</fpage>
<lpage>428</lpage>
<pub-id pub-id-type="doi">10.1038/nrg2999</pub-id>
<pub-id pub-id-type="pmid">21587298</pub-id>
</element-citation>
</ref>
<ref id="CR16">
<label>16.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Denny</surname>
<given-names>JC</given-names>
</name>
</person-group>
<article-title>Chapter 13: Mining electronic health records in the genomics era</article-title>
<source>PLoS Comput Biol.</source>
<year>2012</year>
<volume>8</volume>
<fpage>e1002823</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1002823</pub-id>
<pub-id pub-id-type="pmid">23300414</pub-id>
</element-citation>
</ref>
<ref id="CR17">
<label>17.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Uzuner</surname>
<given-names>O</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Identifying patient smoking status from medical discharge records</article-title>
<source>J. Am. Med Inform. Assoc.</source>
<year>2008</year>
<volume>15</volume>
<fpage>14</fpage>
<lpage>24</lpage>
<pub-id pub-id-type="doi">10.1197/jamia.M2408</pub-id>
<pub-id pub-id-type="pmid">17947624</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<label>18.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>DeLisle</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Combining free text and structured electronic medical record entries to detect acute respiratory infections</article-title>
<source>PLoS One.</source>
<year>2010</year>
<volume>5</volume>
<fpage>e13377</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pone.0013377</pub-id>
<pub-id pub-id-type="pmid">20976281</pub-id>
</element-citation>
</ref>
<ref id="CR19">
<label>19.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Roque</surname>
<given-names>FS</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Using electronic patient records to discover disease correlations and stratify patient cohorts</article-title>
<source>PLoS Comput Biol.</source>
<year>2011</year>
<volume>7</volume>
<fpage>e1002141</fpage>
<pub-id pub-id-type="doi">10.1371/journal.pcbi.1002141</pub-id>
<pub-id pub-id-type="pmid">21901084</pub-id>
</element-citation>
</ref>
<ref id="CR20">
<label>20.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kullo</surname>
<given-names>IJ</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2010</year>
<volume>17</volume>
<fpage>568</fpage>
<lpage>574</lpage>
<pub-id pub-id-type="doi">10.1136/jamia.2010.004366</pub-id>
<pub-id pub-id-type="pmid">20819866</pub-id>
</element-citation>
</ref>
<ref id="CR21">
<label>21.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fernández-Breis</surname>
<given-names>JT</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Leveraging electronic healthcare record standards and semantic web technologies for the identification of patient cohorts</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2013</year>
<volume>20</volume>
<fpage>e288</fpage>
<lpage>296</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2013-001923</pub-id>
<pub-id pub-id-type="pmid">23934950</pub-id>
</element-citation>
</ref>
<ref id="CR22">
<label>22.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Richesson</surname>
<given-names>RL</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2013</year>
<volume>20</volume>
<fpage>e226</fpage>
<lpage>e231</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2013-001926</pub-id>
<pub-id pub-id-type="pmid">23956018</pub-id>
</element-citation>
</ref>
<ref id="CR23">
<label>23.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chaturvedi</surname>
<given-names>AK</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Human papillomavirus and rising oropharyngeal cancer incidence in the United States</article-title>
<source>J. Clin. Oncol.</source>
<year>2011</year>
<volume>29</volume>
<fpage>4294</fpage>
<lpage>4301</lpage>
<pub-id pub-id-type="doi">10.1200/JCO.2011.36.4596</pub-id>
<pub-id pub-id-type="pmid">21969503</pub-id>
</element-citation>
</ref>
<ref id="CR24">
<label>24.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Smith</surname>
<given-names>EM</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Age, sexual behavior and human papillomavirus infection in oral cavity and oropharyngeal cancers</article-title>
<source>Int. J. Cancer.</source>
<year>2004</year>
<volume>108</volume>
<fpage>766</fpage>
<lpage>772</lpage>
<pub-id pub-id-type="doi">10.1002/ijc.11633</pub-id>
<pub-id pub-id-type="pmid">14696105</pub-id>
</element-citation>
</ref>
<ref id="CR25">
<label>25.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gillison</surname>
<given-names>ML</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Distinct risk factor profiles for human papillomavirus type 16-positive and human papillomavirus type 16-negative head and neck cancers</article-title>
<source>J. Natl. Cancer Inst.</source>
<year>2008</year>
<volume>100</volume>
<fpage>407</fpage>
<lpage>420</lpage>
<pub-id pub-id-type="doi">10.1093/jnci/djn025</pub-id>
<pub-id pub-id-type="pmid">18334711</pub-id>
</element-citation>
</ref>
<ref id="CR26">
<label>26.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marur</surname>
<given-names>S</given-names>
</name>
<etal></etal>
</person-group>
<article-title>HPV-associated head and neck cancer: a virus-related cancer epidemic</article-title>
<source>Lancet Oncol.</source>
<year>2010</year>
<volume>11</volume>
<fpage>781</fpage>
<lpage>789</lpage>
<pub-id pub-id-type="doi">10.1016/S1470-2045(10)70017-6</pub-id>
<pub-id pub-id-type="pmid">20451455</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<label>27.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Anaya-Saavedra</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>High association of human papillomavirus infection with oral cancer: a case-control study</article-title>
<source>Arch. Med. Res.</source>
<year>2008</year>
<volume>39</volume>
<fpage>189</fpage>
<lpage>197</lpage>
<pub-id pub-id-type="doi">10.1016/j.arcmed.2007.08.003</pub-id>
<pub-id pub-id-type="pmid">18164962</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<label>28.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Klussmann</surname>
<given-names>JP</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Prevalence, distribution, and viral load of human papillomavirus 16 DNA in tonsillar carcinomas</article-title>
<source>Cancer.</source>
<year>2001</year>
<volume>92</volume>
<fpage>2875</fpage>
<lpage>2884</lpage>
<pub-id pub-id-type="doi">10.1002/1097-0142(20011201)92:11<2875::AID-CNCR10130>3.0.CO;2-7</pub-id>
<pub-id pub-id-type="pmid">11753961</pub-id>
</element-citation>
</ref>
<ref id="CR29">
<label>29.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>D’Souza</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Case-control study of human papillomavirus and oropharyngeal cancer</article-title>
<source>N. Engl. J. Med.</source>
<year>2007</year>
<volume>356</volume>
<fpage>1944</fpage>
<lpage>1956</lpage>
<pub-id pub-id-type="doi">10.1056/NEJMoa065497</pub-id>
<pub-id pub-id-type="pmid">17494927</pub-id>
</element-citation>
</ref>
<ref id="CR30">
<label>30.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Begum</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Westra</surname>
<given-names>WH</given-names>
</name>
</person-group>
<article-title>Basaloid squamous cell carcinoma of the head and neck is a mixed variant that can be further resolved by HPV status</article-title>
<source>Am. J. Surg. Pathol.</source>
<year>2008</year>
<volume>32</volume>
<fpage>1044</fpage>
<lpage>1050</lpage>
<pub-id pub-id-type="doi">10.1097/PAS.0b013e31816380ec</pub-id>
<pub-id pub-id-type="pmid">18496144</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<label>31.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mork</surname>
<given-names>J</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Human papillomavirus infection as a risk factor for squamous-cell carcinoma of the head and neck</article-title>
<source>N. Engl. J. Med.</source>
<year>2001</year>
<volume>344</volume>
<fpage>1125</fpage>
<lpage>1131</lpage>
<pub-id pub-id-type="doi">10.1056/NEJM200104123441503</pub-id>
<pub-id pub-id-type="pmid">11297703</pub-id>
</element-citation>
</ref>
<ref id="CR32">
<label>32.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gillison</surname>
<given-names>ML</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Evidence for a causal association between human papillomavirus and a subset of head and neck cancers</article-title>
<source>J. Natl. Cancer Inst.</source>
<year>2000</year>
<volume>92</volume>
<fpage>709</fpage>
<lpage>720</lpage>
<pub-id pub-id-type="doi">10.1093/jnci/92.9.709</pub-id>
<pub-id pub-id-type="pmid">10793107</pub-id>
</element-citation>
</ref>
<ref id="CR33">
<label>33.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hafkamp</surname>
<given-names>HC</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Marked differences in survival rate between smokers and nonsmokers with HPV 16-associated tonsillar carcinomas</article-title>
<source>Int. J. Cancer.</source>
<year>2008</year>
<volume>122</volume>
<fpage>2656</fpage>
<lpage>2664</lpage>
<pub-id pub-id-type="doi">10.1002/ijc.23458</pub-id>
<pub-id pub-id-type="pmid">18360824</pub-id>
</element-citation>
</ref>
<ref id="CR34">
<label>34.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goldenberg</surname>
<given-names>D</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Cystic lymph node metastasis in patients with head and neck cancer: An HPV-associated phenomenon</article-title>
<source>Head Neck.</source>
<year>2008</year>
<volume>30</volume>
<fpage>898</fpage>
<lpage>903</lpage>
<pub-id pub-id-type="doi">10.1002/hed.20796</pub-id>
<pub-id pub-id-type="pmid">18383529</pub-id>
</element-citation>
</ref>
<ref id="CR35">
<label>35.</label>
<mixed-citation publication-type="other">National Comprehensive Cancer Network. Head and Neck Cancer (Version 1.2016).
<ext-link ext-link-type="uri" xlink:href="https://www.nccn.org/professionals/physician_gls/pdf/head-and-neck.pdf">https://www.nccn.org/professionals/physician_gls/pdf/head-and-neck.pdf</ext-link>
(2016).</mixed-citation>
</ref>
<ref id="CR36">
<label>36.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>O’Sullivan</surname>
<given-names>B</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Development and validation of a staging system for HPV-related oropharyngeal cancer by the International Collaboration on Oropharyngeal cancer Network for Staging (ICON-S): a multicentre cohort study</article-title>
<source>Lancet Oncol.</source>
<year>2016</year>
<volume>17</volume>
<fpage>440</fpage>
<lpage>51</lpage>
<pub-id pub-id-type="doi">10.1016/S1470-2045(15)00560-4</pub-id>
<pub-id pub-id-type="pmid">26936027</pub-id>
</element-citation>
</ref>
<ref id="CR37">
<label>37.</label>
<mixed-citation publication-type="other">Porter, M. F. Snowball: A language for stemming algorithms.
<ext-link ext-link-type="uri" xlink:href="http://snowball.tartarus.org/texts/introduction.html">http://snowball.tartarus.org/texts/introduction.html</ext-link>
, accessed June 2016.</mixed-citation>
</ref>
<ref id="CR38">
<label>38.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Toutanova</surname>
<given-names>K</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Feature-rich part-of-speech tagging with a cyclic dependency network</article-title>
<source>Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology.</source>
<year>2003</year>
<volume>1</volume>
<fpage>173</fpage>
<lpage>180</lpage>
</element-citation>
</ref>
<ref id="CR39">
<label>39.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Klein</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Manning</surname>
<given-names>CD</given-names>
</name>
</person-group>
<article-title>Accurate unlexicalized parsing</article-title>
<source>Proceedings of the 41st Annual Meeting on Association for Computational Linguistics.</source>
<year>2003</year>
<volume>1</volume>
<fpage>423</fpage>
<lpage>430</lpage>
</element-citation>
</ref>
<ref id="CR40">
<label>40.</label>
<mixed-citation publication-type="other">Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical terminology.
<italic>Nucleic Acids Res</italic>
<bold>32</bold>
(Database issue), D267-70 (2004).</mixed-citation>
</ref>
<ref id="CR41">
<label>41.</label>
<mixed-citation publication-type="other">Savova, G. K.
<italic>et al</italic>
. A data-driven approach for extracting “the most specific term” for ontology development.
<italic>AMIA Annu</italic>
.
<italic>Symp</italic>
.
<italic>Proc</italic>
.
<italic>2003</italic>
. 579–583 (2003).</mixed-citation>
</ref>
<ref id="CR42">
<label>42.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bui</surname>
<given-names>DD</given-names>
</name>
<name>
<surname>Zeng-Treitler</surname>
<given-names>Q</given-names>
</name>
</person-group>
<article-title>Learning regular expressions for clinical text classification</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2014</year>
<volume>21</volume>
<fpage>850</fpage>
<lpage>857</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2013-002411</pub-id>
<pub-id pub-id-type="pmid">24578357</pub-id>
</element-citation>
</ref>
<ref id="CR43">
<label>43.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hall</surname>
<given-names>M</given-names>
</name>
<etal></etal>
</person-group>
<article-title>The WEKA Data Mining Software: An Update</article-title>
<source>ACM SIGKDD Explorations.</source>
<year>2009</year>
<volume>11</volume>
<fpage>10</fpage>
<lpage>18</lpage>
<pub-id pub-id-type="doi">10.1145/1656274.1656278</pub-id>
</element-citation>
</ref>
<ref id="CR44">
<label>44.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Freund</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Mason</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>The Alternating Decision Tree Algorithm</article-title>
<source>Proceedings of the 16th International Conference on Machine Learning.</source>
<year>1999</year>
<volume>99</volume>
<fpage>124</fpage>
<lpage>133</lpage>
</element-citation>
</ref>
<ref id="CR45">
<label>45.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Prasse</surname>
<given-names>P</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Learning to identify regular expressions that describe email campaigns. Proceedings of the 29 th International Conference on Machine Learning</article-title>
<source>ArXiv.</source>
<year>2012</year>
<volume>1206</volume>
<fpage>4637</fpage>
</element-citation>
</ref>
<ref id="CR46">
<label>46.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Savova</surname>
<given-names>GK</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2010</year>
<volume>17</volume>
<fpage>507</fpage>
<lpage>513</lpage>
<pub-id pub-id-type="doi">10.1136/jamia.2009.001560</pub-id>
<pub-id pub-id-type="pmid">20819853</pub-id>
</element-citation>
</ref>
<ref id="CR47">
<label>47.</label>
<mixed-citation publication-type="other">Friedman, C. A broad-coverage natural language processing system.
<italic>Proceedings of AMIA Symposium 2000</italic>
, 270–274 (2000).</mixed-citation>
</ref>
<ref id="CR48">
<label>48.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bland</surname>
<given-names>JM</given-names>
</name>
<name>
<surname>Altman</surname>
<given-names>DG</given-names>
</name>
</person-group>
<article-title>Multiple significance tests: the Bonferroni method</article-title>
<source>BMJ.</source>
<year>1995</year>
<volume>310</volume>
<fpage>170</fpage>
<pub-id pub-id-type="doi">10.1136/bmj.310.6973.170</pub-id>
<pub-id pub-id-type="pmid">7833759</pub-id>
</element-citation>
</ref>
<ref id="CR49">
<label>49.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Benjamini</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Hochberg</surname>
<given-names>Y</given-names>
</name>
</person-group>
<article-title>Controlling the false discovery rate: a practical and powerful approach to multiple testing</article-title>
<source>J. R. Stat. Soc. Series. B.</source>
<year>1995</year>
<volume>57</volume>
<fpage>289</fpage>
<lpage>300</lpage>
</element-citation>
</ref>
<ref id="CR50">
<label>50.</label>
<mixed-citation publication-type="other">Saeys, Y., Abeel, T. & Van de Peer, Y. Robust feature selection using ensemble feature selection techniques. In
<italic>Joint European Conference on Machine Learning and Knowledge Discovery in Databases</italic>
, 313–325 (Springer, 2008).</mixed-citation>
</ref>
<ref id="CR51">
<label>51.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hripcsak</surname>
<given-names>G</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Bias associated with mining electronic health records</article-title>
<source>J. Biomed. Discov. Collab.</source>
<year>2011</year>
<volume>6</volume>
<fpage>48</fpage>
<lpage>52</lpage>
<pub-id pub-id-type="doi">10.5210/disco.v6i0.3581</pub-id>
<pub-id pub-id-type="pmid">21647858</pub-id>
</element-citation>
</ref>
<ref id="CR52">
<label>52.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hripcsak</surname>
<given-names>G</given-names>
</name>
<name>
<surname>Albers</surname>
<given-names>DJ</given-names>
</name>
</person-group>
<article-title>Next-generation phenotyping of electronic health records</article-title>
<source>J. Am. Med. Inform. Assoc.</source>
<year>2013</year>
<volume>20</volume>
<fpage>117</fpage>
<lpage>121</lpage>
<pub-id pub-id-type="doi">10.1136/amiajnl-2012-001145</pub-id>
<pub-id pub-id-type="pmid">22955496</pub-id>
</element-citation>
</ref>
<ref id="CR53">
<label>53.</label>
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hersh</surname>
<given-names>WR</given-names>
</name>
<etal></etal>
</person-group>
<article-title>Caveats for the use of operational electronic health record data in comparative effectiveness research</article-title>
<source>Med. Care.</source>
<year>2013</year>
<volume>51</volume>
<issue>8 Suppl 3</issue>
<fpage>S30</fpage>
<lpage>37</lpage>
<pub-id pub-id-type="doi">10.1097/MLR.0b013e31829b1dbd</pub-id>
<pub-id pub-id-type="pmid">23774517</pub-id>
</element-citation>
</ref>
<ref id="CR54">
<label>54.</label>
<mixed-citation publication-type="other">Zou, H.
<italic>et al</italic>
. Regularization and variable selection via the elastic net.
<italic>J</italic>
.
<italic>R</italic>
.
<italic>Stat</italic>
.
<italic>Soc</italic>
.
<italic>B</italic>
, 301–320 (2005).</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Santé/explor/EdenteV2/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D66 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000D66 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Santé
   |area=    EdenteV2
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     PMC:5537364
   |texte=   TEPAPA: a novel in silico feature learning pipeline for mining prognostic and associative factors from text-based electronic medical records
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/RBID.i   -Sk "pubmed:28761061" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd   \
       | NlmPubMed2Wicri -a EdenteV2 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Thu Nov 30 15:26:48 2017. Site generation: Tue Mar 8 16:36:20 2022