Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.
Identifieur interne : 000172 ( Main/Curation ); précédent : 000171; suivant : 000173Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.
Auteurs : Cyril Grouin [France] ; Pierre ZweigenbaumSource :
- Studies in health technology and informatics [ 0926-9630 ] ; 2013.
Descripteurs français
- Wicri :
- geographic : France.
English descriptors
- KwdEn :
- MESH :
Abstract
In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.
PubMed: 23920600
Links toward previous steps (curation, corpus...)
- to stream PubMed, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000021
- to stream PubMed, to step Curation: Pour aller vers cette notice dans l'étape Curation :000021
- to stream PubMed, to step Checkpoint: Pour aller vers cette notice dans l'étape Curation :000021
- to stream Ncbi, to step Merge: Pour aller vers cette notice dans l'étape Curation :000169
- to stream Ncbi, to step Curation: Pour aller vers cette notice dans l'étape Curation :000169
- to stream Ncbi, to step Checkpoint: Pour aller vers cette notice dans l'étape Curation :000169
- to stream Main, to step Merge: Pour aller vers cette notice dans l'étape Curation :000175
Links to Exploration step
pubmed:23920600Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.</title>
<author><name sortKey="Grouin, Cyril" sort="Grouin, Cyril" uniqKey="Grouin C" first="Cyril" last="Grouin">Cyril Grouin</name>
<affiliation wicri:level="1"><nlm:affiliation>LIMSI-CNRS, Orsay, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName><region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Zweigenbaum, Pierre" sort="Zweigenbaum, Pierre" uniqKey="Zweigenbaum P" first="Pierre" last="Zweigenbaum">Pierre Zweigenbaum</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2013">2013</date>
<idno type="RBID">pubmed:23920600</idno>
<idno type="pmid">23920600</idno>
<idno type="wicri:Area/PubMed/Corpus">000021</idno>
<idno type="wicri:Area/PubMed/Curation">000021</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000021</idno>
<idno type="wicri:Area/Ncbi/Merge">000169</idno>
<idno type="wicri:Area/Ncbi/Curation">000169</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000169</idno>
<idno type="wicri:doubleKey">0926-9630:2013:Grouin C:automatic:de:identification</idno>
<idno type="wicri:Area/Main/Merge">000175</idno>
<idno type="wicri:Area/Main/Curation">000172</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.</title>
<author><name sortKey="Grouin, Cyril" sort="Grouin, Cyril" uniqKey="Grouin C" first="Cyril" last="Grouin">Cyril Grouin</name>
<affiliation wicri:level="1"><nlm:affiliation>LIMSI-CNRS, Orsay, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName><region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Zweigenbaum, Pierre" sort="Zweigenbaum, Pierre" uniqKey="Zweigenbaum P" first="Pierre" last="Zweigenbaum">Pierre Zweigenbaum</name>
</author>
</analytic>
<series><title level="j">Studies in health technology and informatics</title>
<idno type="ISSN">0926-9630</idno>
<imprint><date when="2013" type="published">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Artificial Intelligence</term>
<term>Computer Security</term>
<term>Confidentiality</term>
<term>Data Mining (methods)</term>
<term>Electronic Health Records</term>
<term>France</term>
<term>Health Records, Personal</term>
<term>Natural Language Processing</term>
<term>Vocabulary, Controlled</term>
</keywords>
<keywords scheme="MESH" type="geographic" xml:lang="en"><term>France</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en"><term>Data Mining</term>
</keywords>
<keywords scheme="MESH" xml:lang="en"><term>Artificial Intelligence</term>
<term>Computer Security</term>
<term>Confidentiality</term>
<term>Electronic Health Records</term>
<term>Health Records, Personal</term>
<term>Natural Language Processing</term>
<term>Vocabulary, Controlled</term>
</keywords>
<keywords scheme="Wicri" type="geographic" xml:lang="fr"><term>France</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.</div>
</front>
</TEI>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000172 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Curation/biblio.hfd -nk 000172 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Curation |type= RBID |clé= pubmed:23920600 |texte= Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches. }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Curation/RBID.i -Sk "pubmed:23920600" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Curation/biblio.hfd \ | NlmPubMed2Wicri -a OcrV1
This area was generated with Dilib version V0.6.32. |