Domain adaptation of statistical machine translation with domain-focused web crawling
Identifieur interne : 000086 ( Main/Exploration ); précédent : 000085; suivant : 000087Domain adaptation of statistical machine translation with domain-focused web crawling
Auteurs : Pavel Pecina [République tchèque] ; Antonio Toral [Irlande (pays)] ; Vassilis Papavassiliou [Grèce] ; Prokopis Prokopidis [Grèce] ; Aleš Tamchyna [République tchèque] ; Andy Way [Irlande (pays)] ; Josef Van Genabith [Allemagne]Source :
- Language Resources and Evaluation [ 1574-020X ] ; 2014.
Abstract
In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.
Url:
DOI: 10.1007/s10579-014-9282-3
PubMed: 26120290
PubMed Central: 4479164
Affiliations:
- Allemagne, Grèce, Irlande (pays), République tchèque
- Attique (région), Bohême centrale, Sarre (Land)
- Athènes, Prague, Sarrebruck
Links toward previous steps (curation, corpus...)
- to stream Pmc, to step Corpus: 000046
- to stream Pmc, to step Curation: 000045
- to stream Pmc, to step Checkpoint: 000068
- to stream Ncbi, to step Merge: 000124
- to stream Ncbi, to step Curation: 000124
- to stream Ncbi, to step Checkpoint: 000124
- to stream Main, to step Merge: 000086
- to stream Main, to step Curation: 000086
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">Domain adaptation of statistical machine translation with domain-focused web crawling</title>
<author><name sortKey="Pecina, Pavel" sort="Pecina, Pavel" uniqKey="Pecina P" first="Pavel" last="Pecina">Pavel Pecina</name>
<affiliation wicri:level="3"><nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
<country xml:lang="fr">République tchèque</country>
<wicri:regionArea>Charles University in Prague, Prague</wicri:regionArea>
<placeName><settlement type="city">Prague</settlement>
<region type="région" nuts="2">Bohême centrale</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Toral, Antonio" sort="Toral, Antonio" uniqKey="Toral A" first="Antonio" last="Toral">Antonio Toral</name>
<affiliation wicri:level="1"><nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
<country xml:lang="fr">Irlande (pays)</country>
<wicri:regionArea>Dublin City University, Dublin</wicri:regionArea>
<wicri:noRegion>Dublin</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Papavassiliou, Vassilis" sort="Papavassiliou, Vassilis" uniqKey="Papavassiliou V" first="Vassilis" last="Papavassiliou">Vassilis Papavassiliou</name>
<affiliation wicri:level="3"><nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
<country xml:lang="fr">Grèce</country>
<wicri:regionArea>Institute for Language and Speech Processing/Athena RIC, Athens</wicri:regionArea>
<placeName><settlement type="city">Athènes</settlement>
<region nuts="2" type="region">Attique (région)</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Prokopidis, Prokopis" sort="Prokopidis, Prokopis" uniqKey="Prokopidis P" first="Prokopis" last="Prokopidis">Prokopis Prokopidis</name>
<affiliation wicri:level="3"><nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
<country xml:lang="fr">Grèce</country>
<wicri:regionArea>Institute for Language and Speech Processing/Athena RIC, Athens</wicri:regionArea>
<placeName><settlement type="city">Athènes</settlement>
<region nuts="2" type="region">Attique (région)</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Tamchyna, Ales" sort="Tamchyna, Ales" uniqKey="Tamchyna A" first="Aleš" last="Tamchyna">Aleš Tamchyna</name>
<affiliation wicri:level="3"><nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
<country xml:lang="fr">République tchèque</country>
<wicri:regionArea>Charles University in Prague, Prague</wicri:regionArea>
<placeName><settlement type="city">Prague</settlement>
<region type="région" nuts="2">Bohême centrale</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Way, Andy" sort="Way, Andy" uniqKey="Way A" first="Andy" last="Way">Andy Way</name>
<affiliation wicri:level="1"><nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
<country xml:lang="fr">Irlande (pays)</country>
<wicri:regionArea>Dublin City University, Dublin</wicri:regionArea>
<wicri:noRegion>Dublin</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Van Genabith, Josef" sort="Van Genabith, Josef" uniqKey="Van Genabith J" first="Josef" last="Van Genabith">Josef Van Genabith</name>
<affiliation wicri:level="3"><nlm:aff id="Aff4">Universität des Saarlandes, 66123 Saarbrücken, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Universität des Saarlandes, 66123 Saarbrücken</wicri:regionArea>
<placeName><region type="land" nuts="2">Sarre (Land)</region>
<settlement type="city">Sarrebruck</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><nlm:aff id="Aff5">DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken</wicri:regionArea>
<placeName><region type="land" nuts="2">Sarre (Land)</region>
<settlement type="city">Sarrebruck</settlement>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">26120290</idno>
<idno type="pmc">4479164</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479164</idno>
<idno type="RBID">PMC:4479164</idno>
<idno type="doi">10.1007/s10579-014-9282-3</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000046</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000046</idno>
<idno type="wicri:Area/Pmc/Curation">000045</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Curation">000045</idno>
<idno type="wicri:Area/Pmc/Checkpoint">000068</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Checkpoint">000068</idno>
<idno type="wicri:Area/Ncbi/Merge">000124</idno>
<idno type="wicri:Area/Ncbi/Curation">000124</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000124</idno>
<idno type="wicri:doubleKey">1574-020X:2014:Pecina P:domain:adaptation:of</idno>
<idno type="wicri:Area/Main/Merge">000086</idno>
<idno type="wicri:Area/Main/Curation">000086</idno>
<idno type="wicri:Area/Main/Exploration">000086</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">Domain adaptation of statistical machine translation with domain-focused web crawling</title>
<author><name sortKey="Pecina, Pavel" sort="Pecina, Pavel" uniqKey="Pecina P" first="Pavel" last="Pecina">Pavel Pecina</name>
<affiliation wicri:level="3"><nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
<country xml:lang="fr">République tchèque</country>
<wicri:regionArea>Charles University in Prague, Prague</wicri:regionArea>
<placeName><settlement type="city">Prague</settlement>
<region type="région" nuts="2">Bohême centrale</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Toral, Antonio" sort="Toral, Antonio" uniqKey="Toral A" first="Antonio" last="Toral">Antonio Toral</name>
<affiliation wicri:level="1"><nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
<country xml:lang="fr">Irlande (pays)</country>
<wicri:regionArea>Dublin City University, Dublin</wicri:regionArea>
<wicri:noRegion>Dublin</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Papavassiliou, Vassilis" sort="Papavassiliou, Vassilis" uniqKey="Papavassiliou V" first="Vassilis" last="Papavassiliou">Vassilis Papavassiliou</name>
<affiliation wicri:level="3"><nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
<country xml:lang="fr">Grèce</country>
<wicri:regionArea>Institute for Language and Speech Processing/Athena RIC, Athens</wicri:regionArea>
<placeName><settlement type="city">Athènes</settlement>
<region nuts="2" type="region">Attique (région)</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Prokopidis, Prokopis" sort="Prokopidis, Prokopis" uniqKey="Prokopidis P" first="Prokopis" last="Prokopidis">Prokopis Prokopidis</name>
<affiliation wicri:level="3"><nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
<country xml:lang="fr">Grèce</country>
<wicri:regionArea>Institute for Language and Speech Processing/Athena RIC, Athens</wicri:regionArea>
<placeName><settlement type="city">Athènes</settlement>
<region nuts="2" type="region">Attique (région)</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Tamchyna, Ales" sort="Tamchyna, Ales" uniqKey="Tamchyna A" first="Aleš" last="Tamchyna">Aleš Tamchyna</name>
<affiliation wicri:level="3"><nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
<country xml:lang="fr">République tchèque</country>
<wicri:regionArea>Charles University in Prague, Prague</wicri:regionArea>
<placeName><settlement type="city">Prague</settlement>
<region type="région" nuts="2">Bohême centrale</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Way, Andy" sort="Way, Andy" uniqKey="Way A" first="Andy" last="Way">Andy Way</name>
<affiliation wicri:level="1"><nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
<country xml:lang="fr">Irlande (pays)</country>
<wicri:regionArea>Dublin City University, Dublin</wicri:regionArea>
<wicri:noRegion>Dublin</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Van Genabith, Josef" sort="Van Genabith, Josef" uniqKey="Van Genabith J" first="Josef" last="Van Genabith">Josef Van Genabith</name>
<affiliation wicri:level="3"><nlm:aff id="Aff4">Universität des Saarlandes, 66123 Saarbrücken, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>Universität des Saarlandes, 66123 Saarbrücken</wicri:regionArea>
<placeName><region type="land" nuts="2">Sarre (Land)</region>
<settlement type="city">Sarrebruck</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><nlm:aff id="Aff5">DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken, Germany</nlm:aff>
<country xml:lang="fr">Allemagne</country>
<wicri:regionArea>DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken</wicri:regionArea>
<placeName><region type="land" nuts="2">Sarre (Land)</region>
<settlement type="city">Sarrebruck</settlement>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j">Language Resources and Evaluation</title>
<idno type="ISSN">1574-020X</idno>
<idno type="eISSN">1574-0218</idno>
<imprint><date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><p>In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.</p>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct><analytic><author><name sortKey="Ardo, A" uniqKey="Ardo A">A Ardö</name>
</author>
<author><name sortKey="Golub, K" uniqKey="Golub K">K Golub</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Baroni, M" uniqKey="Baroni M">M Baroni</name>
</author>
<author><name sortKey="Bernardini, S" uniqKey="Bernardini S">S Bernardini</name>
</author>
<author><name sortKey="Ferraresi, A" uniqKey="Ferraresi A">A Ferraresi</name>
</author>
<author><name sortKey="Zanchetta, E" uniqKey="Zanchetta E">E Zanchetta</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Bertoldi, N" uniqKey="Bertoldi N">N Bertoldi</name>
</author>
<author><name sortKey="Haddow, B" uniqKey="Haddow B">B Haddow</name>
</author>
<author><name sortKey="Fouet, Jb" uniqKey="Fouet J">JB Fouet</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Brin, S" uniqKey="Brin S">S Brin</name>
</author>
<author><name sortKey="Page, L" uniqKey="Page L">L Page</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Cho, J" uniqKey="Cho J">J Cho</name>
</author>
<author><name sortKey="Garcia Molina, H" uniqKey="Garcia Molina H">H Garcia-Molina</name>
</author>
<author><name sortKey="Page, L" uniqKey="Page L">L Page</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Espla Gomis, M" uniqKey="Espla Gomis M">M Esplà-Gomis</name>
</author>
<author><name sortKey="Forcada, Ml" uniqKey="Forcada M">ML Forcada</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Gao, Z" uniqKey="Gao Z">Z Gao</name>
</author>
<author><name sortKey="Du, Y" uniqKey="Du Y">Y Du</name>
</author>
<author><name sortKey="Yi, L" uniqKey="Yi L">L Yi</name>
</author>
<author><name sortKey="Yang, Y" uniqKey="Yang Y">Y Yang</name>
</author>
<author><name sortKey="Peng, Q" uniqKey="Peng Q">Q Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Kilgarriff, A" uniqKey="Kilgarriff A">A Kilgarriff</name>
</author>
<author><name sortKey="Grefenstette, G" uniqKey="Grefenstette G">G Grefenstette</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Menczer, F" uniqKey="Menczer F">F Menczer</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Menczer, F" uniqKey="Menczer F">F Menczer</name>
</author>
<author><name sortKey="Belew, Rk" uniqKey="Belew R">RK Belew</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Munteanu, Ds" uniqKey="Munteanu D">DS Munteanu</name>
</author>
<author><name sortKey="Marcu, D" uniqKey="Marcu D">D Marcu</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Resnik, P" uniqKey="Resnik P">P Resnik</name>
</author>
<author><name sortKey="Smith, Na" uniqKey="Smith N">NA Smith</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Srinivasan, P" uniqKey="Srinivasan P">P Srinivasan</name>
</author>
<author><name sortKey="Menczer, F" uniqKey="Menczer F">F Menczer</name>
</author>
<author><name sortKey="Pant, G" uniqKey="Pant G">G Pant</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Yu, H" uniqKey="Yu H">H Yu</name>
</author>
<author><name sortKey="Han, J" uniqKey="Han J">J Han</name>
</author>
<author><name sortKey="Chang, Kcc" uniqKey="Chang K">KCC Chang</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<affiliations><list><country><li>Allemagne</li>
<li>Grèce</li>
<li>Irlande (pays)</li>
<li>République tchèque</li>
</country>
<region><li>Attique (région)</li>
<li>Bohême centrale</li>
<li>Sarre (Land)</li>
</region>
<settlement><li>Athènes</li>
<li>Prague</li>
<li>Sarrebruck</li>
</settlement>
</list>
<tree><country name="République tchèque"><region name="Bohême centrale"><name sortKey="Pecina, Pavel" sort="Pecina, Pavel" uniqKey="Pecina P" first="Pavel" last="Pecina">Pavel Pecina</name>
</region>
<name sortKey="Tamchyna, Ales" sort="Tamchyna, Ales" uniqKey="Tamchyna A" first="Aleš" last="Tamchyna">Aleš Tamchyna</name>
</country>
<country name="Irlande (pays)"><noRegion><name sortKey="Toral, Antonio" sort="Toral, Antonio" uniqKey="Toral A" first="Antonio" last="Toral">Antonio Toral</name>
</noRegion>
<name sortKey="Way, Andy" sort="Way, Andy" uniqKey="Way A" first="Andy" last="Way">Andy Way</name>
</country>
<country name="Grèce"><region name="Attique (région)"><name sortKey="Papavassiliou, Vassilis" sort="Papavassiliou, Vassilis" uniqKey="Papavassiliou V" first="Vassilis" last="Papavassiliou">Vassilis Papavassiliou</name>
</region>
<name sortKey="Prokopidis, Prokopis" sort="Prokopidis, Prokopis" uniqKey="Prokopidis P" first="Prokopis" last="Prokopidis">Prokopis Prokopidis</name>
</country>
<country name="Allemagne"><region name="Sarre (Land)"><name sortKey="Van Genabith, Josef" sort="Van Genabith, Josef" uniqKey="Van Genabith J" first="Josef" last="Van Genabith">Josef Van Genabith</name>
</region>
<name sortKey="Van Genabith, Josef" sort="Van Genabith, Josef" uniqKey="Van Genabith J" first="Josef" last="Van Genabith">Josef Van Genabith</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Sarre/explor/MusicSarreV3/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000086 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000086 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Sarre |area= MusicSarreV3 |flux= Main |étape= Exploration |type= RBID |clé= PMC:4479164 |texte= Domain adaptation of statistical machine translation with domain-focused web crawling }}
Pour générer des pages wiki
HfdIndexSelect -h $EXPLOR_AREA/Data/Main/Exploration/RBID.i -Sk "pubmed:26120290" \ | HfdSelect -Kh $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd \ | NlmPubMed2Wicri -a MusicSarreV3
This area was generated with Dilib version V0.6.33. |