Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.

Identifieur interne : 000021 ( PubMed/Curation ); précédent : 000020; suivant : 000022

Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.

Auteurs : Cyril Grouin [France] ; Pierre Zweigenbaum

Source :

RBID : pubmed:23920600

Descripteurs français

English descriptors

Abstract

In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.

PubMed: 23920600

Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:23920600

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.</title>
<author>
<name sortKey="Grouin, Cyril" sort="Grouin, Cyril" uniqKey="Grouin C" first="Cyril" last="Grouin">Cyril Grouin</name>
<affiliation wicri:level="1">
<nlm:affiliation>LIMSI-CNRS, Orsay, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Zweigenbaum, Pierre" sort="Zweigenbaum, Pierre" uniqKey="Zweigenbaum P" first="Pierre" last="Zweigenbaum">Pierre Zweigenbaum</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2013">2013</date>
<idno type="RBID">pubmed:23920600</idno>
<idno type="pmid">23920600</idno>
<idno type="wicri:Area/PubMed/Corpus">000021</idno>
<idno type="wicri:Area/PubMed/Curation">000021</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.</title>
<author>
<name sortKey="Grouin, Cyril" sort="Grouin, Cyril" uniqKey="Grouin C" first="Cyril" last="Grouin">Cyril Grouin</name>
<affiliation wicri:level="1">
<nlm:affiliation>LIMSI-CNRS, Orsay, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, Orsay</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Zweigenbaum, Pierre" sort="Zweigenbaum, Pierre" uniqKey="Zweigenbaum P" first="Pierre" last="Zweigenbaum">Pierre Zweigenbaum</name>
</author>
</analytic>
<series>
<title level="j">Studies in health technology and informatics</title>
<idno type="ISSN">0926-9630</idno>
<imprint>
<date when="2013" type="published">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Artificial Intelligence</term>
<term>Computer Security</term>
<term>Confidentiality</term>
<term>Data Mining (methods)</term>
<term>Electronic Health Records</term>
<term>France</term>
<term>Health Records, Personal</term>
<term>Natural Language Processing</term>
<term>Vocabulary, Controlled</term>
</keywords>
<keywords scheme="MESH" type="geographic" xml:lang="en">
<term>France</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Data Mining</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Artificial Intelligence</term>
<term>Computer Security</term>
<term>Confidentiality</term>
<term>Electronic Health Records</term>
<term>Health Records, Personal</term>
<term>Natural Language Processing</term>
<term>Vocabulary, Controlled</term>
</keywords>
<keywords scheme="Wicri" type="geographic" xml:lang="fr">
<term>France</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.</div>
</front>
</TEI>
<pubmed>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">23920600</PMID>
<DateCreated>
<Year>2013</Year>
<Month>08</Month>
<Day>07</Day>
</DateCreated>
<DateCompleted>
<Year>2015</Year>
<Month>04</Month>
<Day>06</Day>
</DateCompleted>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0926-9630</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>192</Volume>
<PubDate>
<Year>2013</Year>
</PubDate>
</JournalIssue>
<Title>Studies in health technology and informatics</Title>
<ISOAbbreviation>Stud Health Technol Inform</ISOAbbreviation>
</Journal>
<ArticleTitle>Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.</ArticleTitle>
<Pagination>
<MedlinePgn>476-80</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText>In this paper, we present a comparison of two approaches to automatically de-identify medical records written in French: a rule-based system and a machine-learning based system using a conditional random fields (CRF) formalism. Both systems have been designed to process nine identifiers in a corpus of medical records in cardiology. We performed two evaluations: first, on 62 documents in cardiology, and on 10 documents in foetopathology - produced by optical character recognition (OCR) - to evaluate the robustness of our systems. We achieved a 0.843 (rule-based) and 0.883 (machine-learning) exact match overall F-measure in cardiology. While the rule-based system allowed us to achieve good results on nominative (first and last names) and numerical data (dates, phone numbers, and zip codes), the machine-learning approach performed best on more complex categories (postal addresses, hospital names, medical devices, and towns). On the foetopathology corpus, although our systems have not been designed for this corpus and despite OCR character recognition errors, we obtained promising results: a 0.681 (rule-based) and 0.638 (machine-learning) exact-match overall F-measure. This demonstrates that existing tools can be applied to process new documents of lower quality.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Grouin</LastName>
<ForeName>Cyril</ForeName>
<Initials>C</Initials>
<AffiliationInfo>
<Affiliation>LIMSI-CNRS, Orsay, France.</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>Zweigenbaum</LastName>
<ForeName>Pierre</ForeName>
<Initials>P</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>Netherlands</Country>
<MedlineTA>Stud Health Technol Inform</MedlineTA>
<NlmUniqueID>9214582</NlmUniqueID>
<ISSNLinking>0926-9630</ISSNLinking>
</MedlineJournalInfo>
<CitationSubset>T</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="Y" UI="D001185">Artificial Intelligence</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y" UI="D016494">Computer Security</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y" UI="D003219">Confidentiality</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D057225">Data Mining</DescriptorName>
<QualifierName MajorTopicYN="Y" UI="Q000379">methods</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y" UI="D057286">Electronic Health Records</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" Type="Geographic" UI="D005602">France</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y" UI="D055991">Health Records, Personal</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D009323">Natural Language Processing</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y" UI="D018875">Vocabulary, Controlled</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="entrez">
<Year>2013</Year>
<Month>8</Month>
<Day>8</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="pubmed">
<Year>2013</Year>
<Month>8</Month>
<Day>8</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2015</Year>
<Month>4</Month>
<Day>7</Day>
<Hour>6</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">23920600</ArticleId>
</ArticleIdList>
</PubmedData>
</pubmed>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PubMed/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000021 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd -nk 000021 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PubMed
   |étape=   Curation
   |type=    RBID
   |clé=     pubmed:23920600
   |texte=   Automatic de-identification of French clinical records: comparison of rule-based and machine-learning approaches.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/PubMed/Curation/RBID.i   -Sk "pubmed:23920600" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/PubMed/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024