Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
Identifieur interne : 004900 ( Main/Merge ); précédent : 004899; suivant : 004901Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text
Auteurs : Antonio Jimeno Yepes [États-Unis, Australie] ; Élise Prieur-Gaston [France] ; Aurélie Névéol [États-Unis, France]Source :
- BMC Bioinformatics [ 1471-2105 ] ; 2013.
Abstract
Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain.
We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text.
We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts.
Url:
DOI: 10.1186/1471-2105-14-146
PubMed: 23631733
PubMed Central: 3651320
Links toward previous steps (curation, corpus...)
- to stream Pmc, to step Corpus: 001571
- to stream Pmc, to step Curation: 001431
- to stream Pmc, to step Checkpoint: 001C03
- to stream Ncbi, to step Merge: 001243
- to stream Ncbi, to step Curation: 001243
- to stream Ncbi, to step Checkpoint: 001243
Links to Exploration step
PMC:3651320Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text</title>
<author><name sortKey="Jimeno Yepes, Antonio" sort="Jimeno Yepes, Antonio" uniqKey="Jimeno Yepes A" first="Antonio" last="Jimeno Yepes">Antonio Jimeno Yepes</name>
<affiliation wicri:level="1"><nlm:aff id="I1">Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda</wicri:regionArea>
<wicri:noRegion>Bethesda</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><nlm:aff id="I2">NICTA Victoria Research Lab, Melbourne, VIC, 3010, Australia</nlm:aff>
<country xml:lang="fr">Australie</country>
<wicri:regionArea>NICTA Victoria Research Lab, Melbourne, VIC, 3010</wicri:regionArea>
<wicri:noRegion>3010</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Prieur Gaston, Elise" sort="Prieur Gaston, Elise" uniqKey="Prieur Gaston E" first="Élise" last="Prieur-Gaston">Élise Prieur-Gaston</name>
<affiliation wicri:level="4"><nlm:aff id="I3">Université de Rouen, LITIS EA-4108, 1 rue Thomas Becket, Mont Saint-Aignan, F-76821, France</nlm:aff>
<country xml:lang="fr">France</country>
<wicri:regionArea>Université de Rouen, LITIS EA-4108, 1 rue Thomas Becket, Mont Saint-Aignan, F-76821</wicri:regionArea>
<wicri:noRegion>76821</wicri:noRegion>
<orgName type="university">Université de Rouen</orgName>
<placeName><settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Neveol, Aurelie" sort="Neveol, Aurelie" uniqKey="Neveol A" first="Aurélie" last="Névéol">Aurélie Névéol</name>
<affiliation wicri:level="1"><nlm:aff id="I4">National Center for Biotechnology Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>National Center for Biotechnology Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda</wicri:regionArea>
<wicri:noRegion>Bethesda</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><nlm:aff id="I5">LIMSI-CNRS, rue John von Neumann, Orsay, F-91400, France</nlm:aff>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, rue John von Neumann, Orsay, F-91400</wicri:regionArea>
<wicri:noRegion>91400</wicri:noRegion>
<wicri:noRegion>91400</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">23631733</idno>
<idno type="pmc">3651320</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3651320</idno>
<idno type="RBID">PMC:3651320</idno>
<idno type="doi">10.1186/1471-2105-14-146</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">001571</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">001571</idno>
<idno type="wicri:Area/Pmc/Curation">001431</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">001431</idno>
<idno type="wicri:Area/Pmc/Checkpoint">001C03</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">001C03</idno>
<idno type="wicri:Area/Ncbi/Merge">001243</idno>
<idno type="wicri:Area/Ncbi/Curation">001243</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">001243</idno>
<idno type="wicri:Area/Main/Merge">004900</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text</title>
<author><name sortKey="Jimeno Yepes, Antonio" sort="Jimeno Yepes, Antonio" uniqKey="Jimeno Yepes A" first="Antonio" last="Jimeno Yepes">Antonio Jimeno Yepes</name>
<affiliation wicri:level="1"><nlm:aff id="I1">Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda</wicri:regionArea>
<wicri:noRegion>Bethesda</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><nlm:aff id="I2">NICTA Victoria Research Lab, Melbourne, VIC, 3010, Australia</nlm:aff>
<country xml:lang="fr">Australie</country>
<wicri:regionArea>NICTA Victoria Research Lab, Melbourne, VIC, 3010</wicri:regionArea>
<wicri:noRegion>3010</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Prieur Gaston, Elise" sort="Prieur Gaston, Elise" uniqKey="Prieur Gaston E" first="Élise" last="Prieur-Gaston">Élise Prieur-Gaston</name>
<affiliation wicri:level="4"><nlm:aff id="I3">Université de Rouen, LITIS EA-4108, 1 rue Thomas Becket, Mont Saint-Aignan, F-76821, France</nlm:aff>
<country xml:lang="fr">France</country>
<wicri:regionArea>Université de Rouen, LITIS EA-4108, 1 rue Thomas Becket, Mont Saint-Aignan, F-76821</wicri:regionArea>
<wicri:noRegion>76821</wicri:noRegion>
<orgName type="university">Université de Rouen</orgName>
<placeName><settlement type="city">Rouen</settlement>
<region type="region" nuts="2">Région Normandie</region>
<region type="old region" nuts="2">Haute-Normandie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Neveol, Aurelie" sort="Neveol, Aurelie" uniqKey="Neveol A" first="Aurélie" last="Névéol">Aurélie Névéol</name>
<affiliation wicri:level="1"><nlm:aff id="I4">National Center for Biotechnology Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, USA</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>National Center for Biotechnology Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda</wicri:regionArea>
<wicri:noRegion>Bethesda</wicri:noRegion>
</affiliation>
<affiliation wicri:level="1"><nlm:aff id="I5">LIMSI-CNRS, rue John von Neumann, Orsay, F-91400, France</nlm:aff>
<country xml:lang="fr">France</country>
<wicri:regionArea>LIMSI-CNRS, rue John von Neumann, Orsay, F-91400</wicri:regionArea>
<wicri:noRegion>91400</wicri:noRegion>
<wicri:noRegion>91400</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j">BMC Bioinformatics</title>
<idno type="eISSN">1471-2105</idno>
<imprint><date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>Most of the institutional and research information in the biomedical domain is available in the form of English text. Even in countries where English is an official language, such as the United States, language can be a barrier for accessing biomedical information for non-native speakers. Recent progress in machine translation suggests that this technique could help make English texts accessible to speakers of other languages. However, the lack of adequate specialized corpora needed to train statistical models currently limits the quality of automatic translations in the biomedical domain.</p>
</sec>
<sec><title>Results</title>
<p>We show how a large-sized parallel corpus can automatically be obtained for the biomedical domain, using the MEDLINE database. The corpus generated in this work comprises article titles obtained from MEDLINE and abstract text automatically retrieved from journal websites, which substantially extends the corpora used in previous work. After assessing the quality of the corpus for two language pairs (English/French and English/Spanish) we use the Moses package to train a statistical machine translation model that outperforms previous models for automatic translation of biomedical text.</p>
</sec>
<sec><title>Conclusions</title>
<p>We have built translation data sets in the biomedical domain that can easily be extended to other languages available in MEDLINE. These sets can successfully be applied to train statistical machine translation models. While further progress should be made by incorporating out-of-domain corpora and domain-specific lexicons, we believe that this work improves the automatic translation of biomedical texts.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Postman Caucheteux, Wa" uniqKey="Postman Caucheteux W">WA Postman-Caucheteux</name>
</author>
<author><name sortKey="Neveol, A" uniqKey="Neveol A">A Névéol</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Garcia Castillo, D" uniqKey="Garcia Castillo D">D Garcia-Castillo</name>
</author>
<author><name sortKey="Fetters, Md" uniqKey="Fetters M">MD Fetters</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Mark, K" uniqKey="Mark K">K Markó</name>
</author>
<author><name sortKey="Schulz, S" uniqKey="Schulz S">S Schulz</name>
</author>
<author><name sortKey="Hahn, U" uniqKey="Hahn U">U Hahn</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Neveol, A" uniqKey="Neveol A">A Névéol</name>
</author>
<author><name sortKey="Pereira, S" uniqKey="Pereira S">S Pereira</name>
</author>
<author><name sortKey="Soualmia, Lf" uniqKey="Soualmia L">LF Soualmia</name>
</author>
<author><name sortKey="Thirion, B" uniqKey="Thirion B">B Thirion</name>
</author>
<author><name sortKey="Darmoni, Sj" uniqKey="Darmoni S">SJ Darmoni</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Liu, F" uniqKey="Liu F">F Liu</name>
</author>
<author><name sortKey="Ackerman, M" uniqKey="Ackerman M">M Ackerman</name>
</author>
<author><name sortKey="Fontelo, P" uniqKey="Fontelo P">P Fontelo</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Neveol, A" uniqKey="Neveol A">A Névéol</name>
</author>
<author><name sortKey="Ozdowska, S" uniqKey="Ozdowska S">S Ozdowska</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ozdowska, S" uniqKey="Ozdowska S">S Ozdowska</name>
</author>
<author><name sortKey="Neveol, A" uniqKey="Neveol A">A Névéol</name>
</author>
<author><name sortKey="Thirion, B" uniqKey="Thirion B">B Thirion</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Langlais, P" uniqKey="Langlais P">P Langlais</name>
</author>
<author><name sortKey="Carl, M" uniqKey="Carl M">M Carl</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zeng Treitler, Q" uniqKey="Zeng Treitler Q">Q Zeng-Treitler</name>
</author>
<author><name sortKey="Kim, H" uniqKey="Kim H">H Kim</name>
</author>
<author><name sortKey="Rosemblat, G" uniqKey="Rosemblat G">G Rosemblat</name>
</author>
<author><name sortKey="Keselman, A" uniqKey="Keselman A">A Keselman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kirchhoff, K" uniqKey="Kirchhoff K">K Kirchhoff</name>
</author>
<author><name sortKey="Turner, Am" uniqKey="Turner A">AM Turner</name>
</author>
<author><name sortKey="Axelrod, A" uniqKey="Axelrod A">A Axelrod</name>
</author>
<author><name sortKey="Saavedra, F" uniqKey="Saavedra F">F Saavedra</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Wu, C" uniqKey="Wu C">C Wu</name>
</author>
<author><name sortKey="Xia, F" uniqKey="Xia F">F Xia</name>
</author>
<author><name sortKey="Deleger, L" uniqKey="Deleger L">L Deleger</name>
</author>
<author><name sortKey="Solti, I" uniqKey="Solti I">I Solti</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Levenshtein, Vi" uniqKey="Levenshtein V">VI Levenshtein</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Papineni, K" uniqKey="Papineni K">K Papineni</name>
</author>
<author><name sortKey="Roukos, S" uniqKey="Roukos S">S Roukos</name>
</author>
<author><name sortKey="Ward, T" uniqKey="Ward T">T Ward</name>
</author>
<author><name sortKey="Zhu, Wj" uniqKey="Zhu W">WJ Zhu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Callison Burch, C" uniqKey="Callison Burch C">C Callison-Burch</name>
</author>
<author><name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
<author><name sortKey="Monz, C" uniqKey="Monz C">C Monz</name>
</author>
<author><name sortKey="Zaidan, Of" uniqKey="Zaidan O">OF Zaidan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Zhang, Y" uniqKey="Zhang Y">Y Zhang</name>
</author>
<author><name sortKey="Ke, W" uniqKey="Ke W">W Ke</name>
</author>
<author><name sortKey="Gao, J" uniqKey="Gao J">J Gao</name>
</author>
<author><name sortKey="Vine, P" uniqKey="Vine P">P Vine</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Huck, M" uniqKey="Huck M">M Huck</name>
</author>
<author><name sortKey="Vilar, D" uniqKey="Vilar D">D Vilar</name>
</author>
<author><name sortKey="Stein, D" uniqKey="Stein D">D Stein</name>
</author>
<author><name sortKey="Ney, H" uniqKey="Ney H">H Ney</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
<author><name sortKey="Monz, C" uniqKey="Monz C">C Monz</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Haddow, B" uniqKey="Haddow B">B Haddow</name>
</author>
<author><name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Popovic, M" uniqKey="Popovic M">M Popovic</name>
</author>
<author><name sortKey="Vilar, D" uniqKey="Vilar D">D Vilar</name>
</author>
<author><name sortKey="Ney, H" uniqKey="Ney H">H Ney</name>
</author>
<author><name sortKey="Jovicic, S" uniqKey="Jovicic S">S Jovicic</name>
</author>
<author><name sortKey="Saric, Z" uniqKey="Saric Z">Z Saric</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bodenreider, O" uniqKey="Bodenreider O">O Bodenreider</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
<author><name sortKey="Hoang, H" uniqKey="Hoang H">H Hoang</name>
</author>
<author><name sortKey="Birch, A" uniqKey="Birch A">A Birch</name>
</author>
<author><name sortKey="Callison Burch, C" uniqKey="Callison Burch C">C Callison-Burch</name>
</author>
<author><name sortKey="Federico, M" uniqKey="Federico M">M Federico</name>
</author>
<author><name sortKey="Bertoldi, N" uniqKey="Bertoldi N">N Bertoldi</name>
</author>
<author><name sortKey="Cowan, B" uniqKey="Cowan B">B Cowan</name>
</author>
<author><name sortKey="Shen, W" uniqKey="Shen W">W Shen</name>
</author>
<author><name sortKey="Moran, C" uniqKey="Moran C">C Moran</name>
</author>
<author><name sortKey="Zens, R" uniqKey="Zens R">R Zens</name>
</author>
<author><name sortKey="Dyer, C" uniqKey="Dyer C">C Dyer</name>
</author>
<author><name sortKey="Bojar, O" uniqKey="Bojar O">O Bojar</name>
</author>
<author><name sortKey="Constantin, A" uniqKey="Constantin A">A Constantin</name>
</author>
<author><name sortKey="Herbst, E" uniqKey="Herbst E">E Herbst</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Stolcke, A" uniqKey="Stolcke A">A Stolcke</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Och, Fj" uniqKey="Och F">FJ Och</name>
</author>
<author><name sortKey="Ney, H" uniqKey="Ney H">H Ney</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Asie/explor/AustralieFrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 004900 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 004900 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Asie |area= AustralieFrV1 |flux= Main |étape= Merge |type= RBID |clé= PMC:3651320 |texte= Combining MEDLINE and publisher data to create parallel corpora for the automatic translation of biomedical text }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Merge/RBID.i -Sk "pubmed:23631733" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Merge/biblio.hfd \ | NlmPubMed2Wicri -a AustralieFrV1
This area was generated with Dilib version V0.6.33. |