Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.
Identifieur interne : 000169 ( Ncbi/Merge ); précédent : 000168; suivant : 000170Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.
Auteurs : Cyril Grouin [France] ; Pierre ZweigenbaumSource :
- Studies in health technology and informatics [ 0926-9630 ] ; 2013.
Descripteurs français
- Wicri :
- geographic : France.
English descriptors
- KwdEn :
- MESH :
Abstract
In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.
PubMed: 23920600
Links toward previous steps (curation, corpus...)
- to stream PubMed, to step Corpus: 000021
- to stream PubMed, to step Curation: 000021
- to stream PubMed, to step Checkpoint: 000021
Links to Exploration step
pubmed:23920600Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.</title>
<author><name sortKey="Grouin, Cyril" sort="Grouin, Cyril" uniqKey="Grouin C" first="Cyril" last="Grouin">Cyril Grouin</name>
<affiliation wicri:level="1"><nlm:affiliation>LIMSI-CNRS, Orsay, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName><region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Zweigenbaum, Pierre" sort="Zweigenbaum, Pierre" uniqKey="Zweigenbaum P" first="Pierre" last="Zweigenbaum">Pierre Zweigenbaum</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PubMed</idno>
<date when="2013">2013</date>
<idno type="RBID">pubmed:23920600</idno>
<idno type="pmid">23920600</idno>
<idno type="wicri:Area/PubMed/Corpus">000021</idno>
<idno type="wicri:Area/PubMed/Curation">000021</idno>
<idno type="wicri:Area/PubMed/Checkpoint">000021</idno>
<idno type="wicri:Area/Ncbi/Merge">000169</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.</title>
<author><name sortKey="Grouin, Cyril" sort="Grouin, Cyril" uniqKey="Grouin C" first="Cyril" last="Grouin">Cyril Grouin</name>
<affiliation wicri:level="1"><nlm:affiliation>LIMSI-CNRS, Orsay, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
<placeName><region type="région" nuts="2">Île-de-France</region>
<settlement type="city">Orsay</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Zweigenbaum, Pierre" sort="Zweigenbaum, Pierre" uniqKey="Zweigenbaum P" first="Pierre" last="Zweigenbaum">Pierre Zweigenbaum</name>
</author>
</analytic>
<series><title level="j">Studies in health technology and informatics</title>
<idno type="ISSN">0926-9630</idno>
<imprint><date when="2013" type="published">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Artificial Intelligence</term>
<term>Computer Security</term>
<term>Confidentiality</term>
<term>Data Mining (methods)</term>
<term>Electronic Health Records</term>
<term>France</term>
<term>Health Records, Personal</term>
<term>Natural Language Processing</term>
<term>Vocabulary, Controlled</term>
</keywords>
<keywords scheme="MESH" type="geographic" xml:lang="en"><term>France</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en"><term>Data Mining</term>
</keywords>
<keywords scheme="MESH" xml:lang="en"><term>Artificial Intelligence</term>
<term>Computer Security</term>
<term>Confidentiality</term>
<term>Electronic Health Records</term>
<term>Health Records, Personal</term>
<term>Natural Language Processing</term>
<term>Vocabulary, Controlled</term>
</keywords>
<keywords scheme="Wicri" type="geographic" xml:lang="fr"><term>France</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.</div>
</front>
</TEI>
<pubmed><MedlineCitation Owner="NLM" Status="MEDLINE"><PMID Version="1">23920600</PMID>
<DateCreated><Year>2013</Year>
<Month>08</Month>
<Day>07</Day>
</DateCreated>
<DateCompleted><Year>2015</Year>
<Month>04</Month>
<Day>06</Day>
</DateCompleted>
<Article PubModel="Print"><Journal><ISSN IssnType="Print">0926-9630</ISSN>
<JournalIssue CitedMedium="Internet"><Volume>192</Volume>
<PubDate><Year>2013</Year>
</PubDate>
</JournalIssue>
<Title>Studies in health technology and informatics</Title>
<ISOAbbreviation>Stud Health Technol Inform</ISOAbbreviation>
</Journal>
<ArticleTitle>Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.</ArticleTitle>
<Pagination><MedlinePgn>476-80</MedlinePgn>
</Pagination>
<Abstract><AbstractText>In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y"><Author ValidYN="Y"><LastName>Grouin</LastName>
<ForeName>Cyril</ForeName>
<Initials>C</Initials>
<AffiliationInfo><Affiliation>LIMSI-CNRS, Orsay, France.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y"><LastName>Zweigenbaum</LastName>
<ForeName>Pierre</ForeName>
<Initials>P</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList><PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo><Country>Netherlands</Country>
<MedlineTA>Stud Health Technol Inform</MedlineTA>
<NlmUniqueID>9214582</NlmUniqueID>
<ISSNLinking>0926-9630</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>T</CitationSubset>
<MeshHeadingList><MeshHeading><DescriptorName MajorTopicYN="Y" UI="D001185">Artificial Intelligence</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName MajorTopicYN="Y" UI="D016494">Computer Security</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName MajorTopicYN="Y" UI="D003219">Confidentiality</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName MajorTopicYN="N" UI="D057225">Data Mining</DescriptorName>
<QualifierName MajorTopicYN="Y" UI="Q000379">methods</QualifierName>
</MeshHeading>
<MeshHeading><DescriptorName MajorTopicYN="Y" UI="D057286">Electronic Health Records</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName MajorTopicYN="N" Type="Geographic" UI="D005602">France</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName MajorTopicYN="Y" UI="D055991">Health Records, Personal</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName MajorTopicYN="N" UI="D009323">Natural Language Processing</DescriptorName>
</MeshHeading>
<MeshHeading><DescriptorName MajorTopicYN="Y" UI="D018875">Vocabulary, Controlled</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData><History><PubMedPubDate PubStatus="entrez"><Year>2013</Year>
<Month>8</Month>
<Day>8</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed"><Year>2013</Year>
<Month>8</Month>
<Day>8</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline"><Year>2015</Year>
<Month>4</Month>
<Day>7</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList><ArticleId IdType="pubmed">23920600</ArticleId>
</ArticleIdList>
</PubmedData>
</pubmed>
<affiliations><list><country><li>France</li>
</country>
<region><li>Île-de-France</li>
</region>
<settlement><li>Orsay</li>
</settlement>
</list>
<tree><noCountry><name sortKey="Zweigenbaum, Pierre" sort="Zweigenbaum, Pierre" uniqKey="Zweigenbaum P" first="Pierre" last="Zweigenbaum">Pierre Zweigenbaum</name>
</noCountry>
<country name="France"><region name="Île-de-France"><name sortKey="Grouin, Cyril" sort="Grouin, Cyril" uniqKey="Grouin C" first="Cyril" last="Grouin">Cyril Grouin</name>
</region>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Ncbi/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000169 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Ncbi/Merge/biblio.hfd -nk 000169 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Ncbi |étape= Merge |type= RBID |clé= pubmed:23920600 |texte= Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches. }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Ncbi/Merge/RBID.i -Sk "pubmed:23920600" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Ncbi/Merge/biblio.hfd \ | NlmPubMed2Wicri -a OcrV1
This area was generated with Dilib version V0.6.32. |