Serveur d'exploration sur les relations entre la France et l'Australie

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text

Identifieur interne : 004900 ( Main/Merge ); précédent : 004899; suivant : 004901

Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text

Auteurs : Antonio Jimeno Yepes [États-Unis, Australie] ; Élise Prieur-Gaston [France] ; Aurélie Névéol [États-Unis, France]

Source :

RBID : PMC:3651320

Abstract

Background

Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain.

Results

We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text.

Conclusions

We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts.


Url:
DOI: 10.1186/1471-2105-14-146
PubMed: 23631733
PubMed Central: 3651320

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:3651320

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text</title>
<author>
<name sortKey="Jimeno Yepes, Antonio" sort="Jimeno Yepes, Antonio" uniqKey="Jimeno Yepes A" first="Antonio" last="Jimeno Yepes">Antonio Jimeno Yepes</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda</wicri:regionArea>
<wicri:noRegion>Bethesda</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="I2">NICTA Victoria Research Lab, Melbourne, VIC, 3010, Australia</nlm:aff>
<country xml:lang="fr">Australie</country>
<wicri:regionArea>NICTA Victoria Research Lab, Melbourne, VIC, 3010</wicri:regionArea>
<wicri:noRegion>3010</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Prieur Gaston, Elise" sort="Prieur Gaston, Elise" uniqKey="Prieur Gaston E" first="Élise" last="Prieur-Gaston">Élise Prieur-Gaston</name>
<affiliation wicri:level="4">
<nlm:aff id="I3">Université de Rouen, LITIS EA-4108, 1 rue Thomas Becket, Mont Saint-Aignan, F-76821, France</nlm:aff>
<country xml:lang="fr">France</country>
<wicri:regionArea>Université de Rouen, LITIS EA-4108, 1 rue Thomas Becket, Mont Saint-Aignan, F-76821</wicri:regionArea>
<wicri:noRegion>76821</wicri:noRegion>
<orgName type="university">Université de Rouen</orgName>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Neveol, Aurelie" sort="Neveol, Aurelie" uniqKey="Neveol A" first="Aurélie" last="Névéol">Aurélie Névéol</name>
<affiliation wicri:level="1">
<nlm:aff id="I4">National Center for Biotechnology Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>National Center for Biotechnology Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda</wicri:regionArea>
<wicri:noRegion>Bethesda</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="I5">LIMSI-CNRS, rue John von Neumann, Orsay, F-91400, France</nlm:aff>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, rue John von Neumann, Orsay, F-91400</wicri:regionArea>
<wicri:noRegion>91400</wicri:noRegion>
<wicri:noRegion>91400</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23631733</idno>
<idno type="pmc">3651320</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3651320</idno>
<idno type="RBID">PMC:3651320</idno>
<idno type="doi">10.1186/1471-2105-14-146</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">001571</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001571</idno>
<idno type="wicri:Area/Pmc/Curation">001431</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">001431</idno>
<idno type="wicri:Area/Pmc/Checkpoint">001C03</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">001C03</idno>
<idno type="wicri:Area/Ncbi/Merge">001243</idno>
<idno type="wicri:Area/Ncbi/Curation">001243</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">001243</idno>
<idno type="wicri:Area/Main/Merge">004900</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text</title>
<author>
<name sortKey="Jimeno Yepes, Antonio" sort="Jimeno Yepes, Antonio" uniqKey="Jimeno Yepes A" first="Antonio" last="Jimeno Yepes">Antonio Jimeno Yepes</name>
<affiliation wicri:level="1">
<nlm:aff id="I1">Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda</wicri:regionArea>
<wicri:noRegion>Bethesda</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="I2">NICTA Victoria Research Lab, Melbourne, VIC, 3010, Australia</nlm:aff>
<country xml:lang="fr">Australie</country>
<wicri:regionArea>NICTA Victoria Research Lab, Melbourne, VIC, 3010</wicri:regionArea>
<wicri:noRegion>3010</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Prieur Gaston, Elise" sort="Prieur Gaston, Elise" uniqKey="Prieur Gaston E" first="Élise" last="Prieur-Gaston">Élise Prieur-Gaston</name>
<affiliation wicri:level="4">
<nlm:aff id="I3">Université de Rouen, LITIS EA-4108, 1 rue Thomas Becket, Mont Saint-Aignan, F-76821, France</nlm:aff>
<country xml:lang="fr">France</country>
<wicri:regionArea>Université de Rouen, LITIS EA-4108, 1 rue Thomas Becket, Mont Saint-Aignan, F-76821</wicri:regionArea>
<wicri:noRegion>76821</wicri:noRegion>
<orgName type="university">Université de Rouen</orgName>
<placeName>
<settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Neveol, Aurelie" sort="Neveol, Aurelie" uniqKey="Neveol A" first="Aurélie" last="Névéol">Aurélie Névéol</name>
<affiliation wicri:level="1">
<nlm:aff id="I4">National Center for Biotechnology Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>National Center for Biotechnology Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda</wicri:regionArea>
<wicri:noRegion>Bethesda</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="I5">LIMSI-CNRS, rue John von Neumann, Orsay, F-91400, France</nlm:aff>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, rue John von Neumann, Orsay, F-91400</wicri:regionArea>
<wicri:noRegion>91400</wicri:noRegion>
<wicri:noRegion>91400</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<sec>
<title>Background</title>
<p>Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain.</p>
</sec>
<sec>
<title>Results</title>
<p>We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text.</p>
</sec>
<sec>
<title>Conclusions</title>
<p>We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts.</p>
</sec>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Postman Caucheteux, Wa" uniqKey="Postman Caucheteux W">WA Postman-Caucheteux</name>
</author>
<author>
<name sortKey="Neveol, A" uniqKey="Neveol A">A Névéol</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Garcia Castillo, D" uniqKey="Garcia Castillo D">D Garcia-Castillo</name>
</author>
<author>
<name sortKey="Fetters, Md" uniqKey="Fetters M">MD Fetters</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Mark, K" uniqKey="Mark K">K Markó</name>
</author>
<author>
<name sortKey="Schulz, S" uniqKey="Schulz S">S Schulz</name>
</author>
<author>
<name sortKey="Hahn, U" uniqKey="Hahn U">U Hahn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Neveol, A" uniqKey="Neveol A">A Névéol</name>
</author>
<author>
<name sortKey="Pereira, S" uniqKey="Pereira S">S Pereira</name>
</author>
<author>
<name sortKey="Soualmia, Lf" uniqKey="Soualmia L">LF Soualmia</name>
</author>
<author>
<name sortKey="Thirion, B" uniqKey="Thirion B">B Thirion</name>
</author>
<author>
<name sortKey="Darmoni, Sj" uniqKey="Darmoni S">SJ Darmoni</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, F" uniqKey="Liu F">F Liu</name>
</author>
<author>
<name sortKey="Ackerman, M" uniqKey="Ackerman M">M Ackerman</name>
</author>
<author>
<name sortKey="Fontelo, P" uniqKey="Fontelo P">P Fontelo</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Neveol, A" uniqKey="Neveol A">A Névéol</name>
</author>
<author>
<name sortKey="Ozdowska, S" uniqKey="Ozdowska S">S Ozdowska</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Ozdowska, S" uniqKey="Ozdowska S">S Ozdowska</name>
</author>
<author>
<name sortKey="Neveol, A" uniqKey="Neveol A">A Névéol</name>
</author>
<author>
<name sortKey="Thirion, B" uniqKey="Thirion B">B Thirion</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Langlais, P" uniqKey="Langlais P">P Langlais</name>
</author>
<author>
<name sortKey="Carl, M" uniqKey="Carl M">M Carl</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zeng Treitler, Q" uniqKey="Zeng Treitler Q">Q Zeng-Treitler</name>
</author>
<author>
<name sortKey="Kim, H" uniqKey="Kim H">H Kim</name>
</author>
<author>
<name sortKey="Rosemblat, G" uniqKey="Rosemblat G">G Rosemblat</name>
</author>
<author>
<name sortKey="Keselman, A" uniqKey="Keselman A">A Keselman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kirchhoff, K" uniqKey="Kirchhoff K">K Kirchhoff</name>
</author>
<author>
<name sortKey="Turner, Am" uniqKey="Turner A">AM Turner</name>
</author>
<author>
<name sortKey="Axelrod, A" uniqKey="Axelrod A">A Axelrod</name>
</author>
<author>
<name sortKey="Saavedra, F" uniqKey="Saavedra F">F Saavedra</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wu, C" uniqKey="Wu C">C Wu</name>
</author>
<author>
<name sortKey="Xia, F" uniqKey="Xia F">F Xia</name>
</author>
<author>
<name sortKey="Deleger, L" uniqKey="Deleger L">L Deleger</name>
</author>
<author>
<name sortKey="Solti, I" uniqKey="Solti I">I Solti</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Levenshtein, Vi" uniqKey="Levenshtein V">VI Levenshtein</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Papineni, K" uniqKey="Papineni K">K Papineni</name>
</author>
<author>
<name sortKey="Roukos, S" uniqKey="Roukos S">S Roukos</name>
</author>
<author>
<name sortKey="Ward, T" uniqKey="Ward T">T Ward</name>
</author>
<author>
<name sortKey="Zhu, Wj" uniqKey="Zhu W">WJ Zhu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Callison Burch, C" uniqKey="Callison Burch C">C Callison-Burch</name>
</author>
<author>
<name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
<author>
<name sortKey="Monz, C" uniqKey="Monz C">C Monz</name>
</author>
<author>
<name sortKey="Zaidan, Of" uniqKey="Zaidan O">OF Zaidan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author>
<name sortKey="Ke, W" uniqKey="Ke W">W Ke</name>
</author>
<author>
<name sortKey="Gao, J" uniqKey="Gao J">J Gao</name>
</author>
<author>
<name sortKey="Vine, P" uniqKey="Vine P">P Vine</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Huck, M" uniqKey="Huck M">M Huck</name>
</author>
<author>
<name sortKey="Vilar, D" uniqKey="Vilar D">D Vilar</name>
</author>
<author>
<name sortKey="Stein, D" uniqKey="Stein D">D Stein</name>
</author>
<author>
<name sortKey="Ney, H" uniqKey="Ney H">H Ney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
<author>
<name sortKey="Monz, C" uniqKey="Monz C">C Monz</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Haddow, B" uniqKey="Haddow B">B Haddow</name>
</author>
<author>
<name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Popovic, M" uniqKey="Popovic M">M Popovic</name>
</author>
<author>
<name sortKey="Vilar, D" uniqKey="Vilar D">D Vilar</name>
</author>
<author>
<name sortKey="Ney, H" uniqKey="Ney H">H Ney</name>
</author>
<author>
<name sortKey="Jovicic, S" uniqKey="Jovicic S">S Jovicic</name>
</author>
<author>
<name sortKey="Saric, Z" uniqKey="Saric Z">Z Saric</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bodenreider, O" uniqKey="Bodenreider O">O Bodenreider</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
<author>
<name sortKey="Hoang, H" uniqKey="Hoang H">H Hoang</name>
</author>
<author>
<name sortKey="Birch, A" uniqKey="Birch A">A Birch</name>
</author>
<author>
<name sortKey="Callison Burch, C" uniqKey="Callison Burch C">C Callison-Burch</name>
</author>
<author>
<name sortKey="Federico, M" uniqKey="Federico M">M Federico</name>
</author>
<author>
<name sortKey="Bertoldi, N" uniqKey="Bertoldi N">N Bertoldi</name>
</author>
<author>
<name sortKey="Cowan, B" uniqKey="Cowan B">B Cowan</name>
</author>
<author>
<name sortKey="Shen, W" uniqKey="Shen W">W Shen</name>
</author>
<author>
<name sortKey="Moran, C" uniqKey="Moran C">C Moran</name>
</author>
<author>
<name sortKey="Zens, R" uniqKey="Zens R">R Zens</name>
</author>
<author>
<name sortKey="Dyer, C" uniqKey="Dyer C">C Dyer</name>
</author>
<author>
<name sortKey="Bojar, O" uniqKey="Bojar O">O Bojar</name>
</author>
<author>
<name sortKey="Constantin, A" uniqKey="Constantin A">A Constantin</name>
</author>
<author>
<name sortKey="Herbst, E" uniqKey="Herbst E">E Herbst</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Stolcke, A" uniqKey="Stolcke A">A Stolcke</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Och, Fj" uniqKey="Och F">FJ Och</name>
</author>
<author>
<name sortKey="Ney, H" uniqKey="Ney H">H Ney</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Asie/explor/AustralieFrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 004900 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 004900 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Asie
   |area=    AustralieFrV1
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     PMC:3651320
   |texte=   Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Merge/RBID.i   -Sk "pubmed:23631733" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Merge/biblio.hfd   \
       | NlmPubMed2Wicri -a AustralieFrV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Tue Dec 5 10:43:12 2017. Site generation: Tue Mar 5 14:07:20 2024