Serveur d'exploration sur la musique en Sarre

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 0000460 ( Pmc/Corpus ); précédent : 0000459; suivant : 0000461 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Domain adaptation of statistical machine translation with domain-focused web crawling</title>
<author>
<name sortKey="Pecina, Pavel" sort="Pecina, Pavel" uniqKey="Pecina P" first="Pavel" last="Pecina">Pavel Pecina</name>
<affiliation>
<nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Toral, Antonio" sort="Toral, Antonio" uniqKey="Toral A" first="Antonio" last="Toral">Antonio Toral</name>
<affiliation>
<nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Papavassiliou, Vassilis" sort="Papavassiliou, Vassilis" uniqKey="Papavassiliou V" first="Vassilis" last="Papavassiliou">Vassilis Papavassiliou</name>
<affiliation>
<nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Prokopidis, Prokopis" sort="Prokopidis, Prokopis" uniqKey="Prokopidis P" first="Prokopis" last="Prokopidis">Prokopis Prokopidis</name>
<affiliation>
<nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tamchyna, Ales" sort="Tamchyna, Ales" uniqKey="Tamchyna A" first="Aleš" last="Tamchyna">Aleš Tamchyna</name>
<affiliation>
<nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Way, Andy" sort="Way, Andy" uniqKey="Way A" first="Andy" last="Way">Andy Way</name>
<affiliation>
<nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Van Genabith, Josef" sort="Van Genabith, Josef" uniqKey="Van Genabith J" first="Josef" last="Van Genabith">Josef Van Genabith</name>
<affiliation>
<nlm:aff id="Aff4">Universität des Saarlandes, 66123 Saarbrücken, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff5">DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken, Germany</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26120290</idno>
<idno type="pmc">4479164</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4479164</idno>
<idno type="RBID">PMC:4479164</idno>
<idno type="doi">10.1007/s10579-014-9282-3</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000046</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000046</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Domain adaptation of statistical machine translation with domain-focused web crawling</title>
<author>
<name sortKey="Pecina, Pavel" sort="Pecina, Pavel" uniqKey="Pecina P" first="Pavel" last="Pecina">Pavel Pecina</name>
<affiliation>
<nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Toral, Antonio" sort="Toral, Antonio" uniqKey="Toral A" first="Antonio" last="Toral">Antonio Toral</name>
<affiliation>
<nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Papavassiliou, Vassilis" sort="Papavassiliou, Vassilis" uniqKey="Papavassiliou V" first="Vassilis" last="Papavassiliou">Vassilis Papavassiliou</name>
<affiliation>
<nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Prokopidis, Prokopis" sort="Prokopidis, Prokopis" uniqKey="Prokopidis P" first="Prokopis" last="Prokopidis">Prokopis Prokopidis</name>
<affiliation>
<nlm:aff id="Aff3">Institute for Language and Speech Processing/Athena RIC, Athens, Greece</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Tamchyna, Ales" sort="Tamchyna, Ales" uniqKey="Tamchyna A" first="Aleš" last="Tamchyna">Aleš Tamchyna</name>
<affiliation>
<nlm:aff id="Aff1">Charles University in Prague, Prague, Czech Republic</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Way, Andy" sort="Way, Andy" uniqKey="Way A" first="Andy" last="Way">Andy Way</name>
<affiliation>
<nlm:aff id="Aff2">Dublin City University, Dublin, Ireland</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Van Genabith, Josef" sort="Van Genabith, Josef" uniqKey="Van Genabith J" first="Josef" last="Van Genabith">Josef Van Genabith</name>
<affiliation>
<nlm:aff id="Aff4">Universität des Saarlandes, 66123 Saarbrücken, Germany</nlm:aff>
</affiliation>
<affiliation>
<nlm:aff id="Aff5">DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken, Germany</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Language Resources and Evaluation</title>
<idno type="ISSN">1574-020X</idno>
<idno type="eISSN">1574-0218</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Ardo, A" uniqKey="Ardo A">A Ardö</name>
</author>
<author>
<name sortKey="Golub, K" uniqKey="Golub K">K Golub</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Baroni, M" uniqKey="Baroni M">M Baroni</name>
</author>
<author>
<name sortKey="Bernardini, S" uniqKey="Bernardini S">S Bernardini</name>
</author>
<author>
<name sortKey="Ferraresi, A" uniqKey="Ferraresi A">A Ferraresi</name>
</author>
<author>
<name sortKey="Zanchetta, E" uniqKey="Zanchetta E">E Zanchetta</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bertoldi, N" uniqKey="Bertoldi N">N Bertoldi</name>
</author>
<author>
<name sortKey="Haddow, B" uniqKey="Haddow B">B Haddow</name>
</author>
<author>
<name sortKey="Fouet, Jb" uniqKey="Fouet J">JB Fouet</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Brin, S" uniqKey="Brin S">S Brin</name>
</author>
<author>
<name sortKey="Page, L" uniqKey="Page L">L Page</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Cho, J" uniqKey="Cho J">J Cho</name>
</author>
<author>
<name sortKey="Garcia Molina, H" uniqKey="Garcia Molina H">H Garcia-Molina</name>
</author>
<author>
<name sortKey="Page, L" uniqKey="Page L">L Page</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Espla Gomis, M" uniqKey="Espla Gomis M">M Esplà-Gomis</name>
</author>
<author>
<name sortKey="Forcada, Ml" uniqKey="Forcada M">ML Forcada</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Gao, Z" uniqKey="Gao Z">Z Gao</name>
</author>
<author>
<name sortKey="Du, Y" uniqKey="Du Y">Y Du</name>
</author>
<author>
<name sortKey="Yi, L" uniqKey="Yi L">L Yi</name>
</author>
<author>
<name sortKey="Yang, Y" uniqKey="Yang Y">Y Yang</name>
</author>
<author>
<name sortKey="Peng, Q" uniqKey="Peng Q">Q Peng</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kilgarriff, A" uniqKey="Kilgarriff A">A Kilgarriff</name>
</author>
<author>
<name sortKey="Grefenstette, G" uniqKey="Grefenstette G">G Grefenstette</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Menczer, F" uniqKey="Menczer F">F Menczer</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Menczer, F" uniqKey="Menczer F">F Menczer</name>
</author>
<author>
<name sortKey="Belew, Rk" uniqKey="Belew R">RK Belew</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Munteanu, Ds" uniqKey="Munteanu D">DS Munteanu</name>
</author>
<author>
<name sortKey="Marcu, D" uniqKey="Marcu D">D Marcu</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Resnik, P" uniqKey="Resnik P">P Resnik</name>
</author>
<author>
<name sortKey="Smith, Na" uniqKey="Smith N">NA Smith</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Srinivasan, P" uniqKey="Srinivasan P">P Srinivasan</name>
</author>
<author>
<name sortKey="Menczer, F" uniqKey="Menczer F">F Menczer</name>
</author>
<author>
<name sortKey="Pant, G" uniqKey="Pant G">G Pant</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yu, H" uniqKey="Yu H">H Yu</name>
</author>
<author>
<name sortKey="Han, J" uniqKey="Han J">J Han</name>
</author>
<author>
<name sortKey="Chang, Kcc" uniqKey="Chang K">KCC Chang</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Lang Resour Eval</journal-id>
<journal-id journal-id-type="iso-abbrev">Lang Resour Eval</journal-id>
<journal-title-group>
<journal-title>Language Resources and Evaluation</journal-title>
</journal-title-group>
<issn pub-type="ppub">1574-020X</issn>
<issn pub-type="epub">1574-0218</issn>
<publisher>
<publisher-name>Springer Netherlands</publisher-name>
<publisher-loc>Dordrecht</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26120290</article-id>
<article-id pub-id-type="pmc">4479164</article-id>
<article-id pub-id-type="publisher-id">9282</article-id>
<article-id pub-id-type="doi">10.1007/s10579-014-9282-3</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Original Paper</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Domain adaptation of statistical machine translation with domain-focused web crawling</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Pecina</surname>
<given-names>Pavel</given-names>
</name>
<address>
<email>pecina@ufal.mff.cuni.cz</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Toral</surname>
<given-names>Antonio</given-names>
</name>
<address>
<email>atoral@computing.dcu.ie</email>
</address>
<xref ref-type="aff" rid="Aff2"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Papavassiliou</surname>
<given-names>Vassilis</given-names>
</name>
<address>
<email>vpapa@ilsp.gr</email>
</address>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Prokopidis</surname>
<given-names>Prokopis</given-names>
</name>
<address>
<email>prokopis@ilsp.gr</email>
</address>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Tamchyna</surname>
<given-names>Aleš</given-names>
</name>
<address>
<email>tamchyna@ufal.mff.cuni.cz</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Way</surname>
<given-names>Andy</given-names>
</name>
<address>
<email>away@computing.dcu.ie</email>
</address>
<xref ref-type="aff" rid="Aff2"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>van Genabith</surname>
<given-names>Josef</given-names>
</name>
<address>
<email>josef.vangenabith@uni-saarland.de</email>
</address>
<xref ref-type="aff" rid="Aff4"></xref>
<xref ref-type="aff" rid="Aff5"></xref>
</contrib>
<aff id="Aff1">
<label></label>
Charles University in Prague, Prague, Czech Republic</aff>
<aff id="Aff2">
<label></label>
Dublin City University, Dublin, Ireland</aff>
<aff id="Aff3">
<label></label>
Institute for Language and Speech Processing/Athena RIC, Athens, Greece</aff>
<aff id="Aff4">
<label></label>
Universität des Saarlandes, 66123 Saarbrücken, Germany</aff>
<aff id="Aff5">
<label></label>
DFKI, German Research Center for Artificial Intelligence, 66123 Saarbrücken, Germany</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>3</day>
<month>12</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>3</day>
<month>12</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="ppub">
<year>2015</year>
</pub-date>
<volume>49</volume>
<issue>1</issue>
<fpage>147</fpage>
<lpage>193</lpage>
<permissions>
<copyright-statement>© The Author(s) 2015</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<p>In this paper, we tackle the problem of domain adaptation of statistical machine translation (SMT) by exploiting domain-specific data acquired by domain-focused crawling of text from the World Wide Web. We design and empirically evaluate a procedure for automatic acquisition of monolingual and parallel text and their exploitation for system training, tuning, and testing in a phrase-based SMT framework. We present a strategy for using such resources depending on their availability and quantity supported by results of a large-scale evaluation carried out for the domains of environment and labour legislation, two language pairs (English–French and English–Greek) and in both directions: into and from English. In general, machine translation systems trained and tuned on a general domain perform poorly on specific domains and we show that such systems can be adapted successfully by retuning model parameters using small amounts of parallel in-domain data, and may be further improved by using additional monolingual and parallel training data for adaptation of language and translation models. The average observed improvement in BLEU achieved is substantial at 15.30 points absolute.</p>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Statistical machine translation</kwd>
<kwd>Domain adaptation </kwd>
<kwd>Web crawling</kwd>
<kwd>Optimisation</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© Springer Science+Business Media Dordrecht 2015</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1" sec-type="intro">
<title>Introduction</title>
<p>Recent advances in statistical machine translation (SMT) have improved machine translation (MT) quality to such an extent that it can be successfully used in industrial processes (e.g., Flournoy and Duran
<xref ref-type="bibr" rid="CR25">2009</xref>
). However, this mostly happens only in specific domains where ample training data is available (e.g., Wu et al.
<xref ref-type="bibr" rid="CR69">2008</xref>
). Using in-domain data for training has a substantial effect on the final translation quality: the performance of an SMT system usually drops when it is applied to data of a different nature than that on what it was trained (e.g., Banerjee et al.
<xref ref-type="bibr" rid="CR3">2010</xref>
).</p>
<p>SMT is an instance of a machine learning application which in general works best if the data for training and testing are drawn from the same distribution (i.e., domain, genre, and style). In practice, however, it is often difficult to obtain sufficient amounts of in-domain data (in particular, parallel data required for translation and reordering models) to train a system with good performance for a specific domain. The main problem is usually vocabulary coverage: domain-specific texts typically contain a substantial amount of special vocabulary, which is not likely to be found in texts from other domains (Banerjee et al.
<xref ref-type="bibr" rid="CR3">2010</xref>
). Additional problems can be caused by divergence in style or genre, where the difference is not only in lexis but also in other linguistic aspects such as grammar.</p>
<p>In order to achieve optimal performance, an SMT system should be trained on data from the same domain, genre, and style as it is intended to be applied to. For many domains, though, in-domain data of a sufficient size to train an SMT system with good performance is difficult to find. Recent experiments have shown that even small amounts of such data can be used to adapt an existing (general-domain) system to the particular domain of interest (Koehn et al.
<xref ref-type="bibr" rid="CR39">2007</xref>
). Sometimes, appropriate sources of such data come in the form of existing in-house databases and translation memories (He et al.
<xref ref-type="bibr" rid="CR30">2010</xref>
). An alternative option pursued in this paper is to exploit the constantly growing amount of publicly available text on the web, although acquiring data of a sufficient quality and quantity from this resource is a complicated process involving several critical steps (crawling, language identification, cleaning, etc.).</p>
<p>In this research, we first present a strategy and relevant workflows for automatic web-crawling and cleaning of domain-specific data with limited manual intervention. These workflows are based on open-source tools and have also been deployed as web services in the context of the Panacea
<xref ref-type="fn" rid="Fn1">1</xref>
research project (Poch et al.
<xref ref-type="bibr" rid="CR55">2012</xref>
). One advantage of making the tools available as services is that chaining them together enables the building of dynamic and flexible workflows, which can always be improved by integrating new services and/or old legacy systems that may run on different technological platforms. Moreover, the user does not have to deal with technical issues regarding the tools, such as their installation, configuration, or maintenance.</p>
<p>These workflows are then employed to acquire monolingual and parallel data for two domains: environment (
<italic>env</italic>
) and labour legislation (
<italic>lab</italic>
), and two language pairs: English–French (EN–FR) and English–Greek (EN–EL). The crawled data is further exploited for domain adaptation of a general-domain SMT system in several ways: by domain-specific parameter tuning of the main log-linear model and by adaptation of its components. The evaluation experiments carried out in a total of eight evaluation scenarios (two domains, two language pairs, and both translation directions: to and from English) confirm substantial and consistent improvements in translation quality for all approaches compared to the baseline.</p>
<p>We explain the improvements brought about by analysing the experimental results in detail. In a nutshell, tuning for matching-domain training and test data results in weight vectors that trust (often long) translation table entries. Tuning with and for specific domains (while using generic training data) allows the MT system to stitch together translations from smaller fragments which, in this case, leads to improved translation quality. Such tuning requires only small development sets which can be harvested automatically from the web with minimal human intervention; no manual cleaning of the development data is necessary.</p>
<p>In addition, additional improvements are realised by using monolingual and/or parallel in-domain training data. Adaptation of language models focuses on improving translation fluency and lexical selection for the particular domain. Adaptation of the translation model then aims at reduction of the out-of-vocabulary (OOV) rate and adding domain-relevant translation variants. All the data sets are available via the European Language Resources Association (ELRA).</p>
<p>This paper is an extended and updated version of our previous work published as Pecina et al. (
<xref ref-type="bibr" rid="CR51">2011</xref>
,
<xref ref-type="bibr" rid="CR53">2012a</xref>
,
<xref ref-type="bibr" rid="CR52">b</xref>
). Compared to these conference papers, we provide more details of the experiments, full results and a more thorough analysis and description of our findings. Some experiments are new, and not contained in the standalone papers. These include a comparison of various methods for adaptation of language models and translation models (including the state-of-the-art linear interpolation), as well as the comparison of OOV rate (i.e., the ratio of source words unknown to the translation model), language model perplexity measures, and average phrase length in the test set translations (cf. Table
<xref rid="Tab15" ref-type="table">15</xref>
). Compared to the previous papers, the translation quality evaluation in this work is conducted on tokenized and lowercased translations to avoid any bias caused by recasing and detokenization. We also provide much longer descriptions of both related work as well as our data acquisition procedure. Finally, we formulate this paper as one concise yet coherent account of the full range of experiments carried out.</p>
<p>The remainder of the paper is organised as follows. After the overview of related work and description of the state-of-the-art in Sect. 
<xref rid="Sec2" ref-type="sec">2</xref>
, we present our web-crawling procedure for monolingual and parallel data in Sect. 
<xref rid="Sec7" ref-type="sec">3</xref>
, and the baseline system including its evaluation in Sect. 
<xref rid="Sec12" ref-type="sec">4</xref>
. Section 
<xref rid="Sec16" ref-type="sec">5</xref>
is devoted to system adaptation by parameter tuning and Sect. 
<xref rid="Sec22" ref-type="sec">6</xref>
to adaptation of language and translation models. Section 
<xref rid="Sec26" ref-type="sec">7</xref>
, which concludes the work, is followed by an
<xref rid="App1" ref-type="app">Appendix</xref>
containing formal definitions of the two domains relevant to our work and complete results of the main experiments.</p>
</sec>
<sec id="Sec2">
<title>State-of-the-art and related work</title>
<p>In this section, we review the current state-of-the-art in the area of web crawling for monolingual as well as parallel data and briefly describe the main concepts of phrase-based SMT (PB-SMT) and its adaptation to specific domains.</p>
<sec id="Sec3">
<title>Web crawling for textual data</title>
<p>Web crawling is the automatic process of travelling through the World Wide Web by extracting links of already fetched web pages and adding them to the list of pages to be visited. The selection of the next link to be followed is a key challenge for the evolution of the crawl and is tied to the goal of the process. For example, a crawler that aims to index the web as a whole may not prioritise the links at all, while a focused/topical crawler that aspires to build domain-specific web collections (Qin and Chen
<xref ref-type="bibr" rid="CR57">2005</xref>
) may use a relevance score to decide which pages to visit first or not at all.</p>
<p>Several algorithms have been exploited for selecting the most promising links. The Best-First algorithm (Cho et al.
<xref ref-type="bibr" rid="CR17">1998</xref>
) sorts the links with respect to their relevance scores and selects a predefined amount of them as the seeds for the next crawling cycle. The PageRank (Brin and Page
<xref ref-type="bibr" rid="CR13">1998</xref>
) algorithm exploits the ‘popularity’ of a web page, i.e., the probability that a random crawler will visit that page at any given time, instead of its relevance. Menczer and Belew (
<xref ref-type="bibr" rid="CR44">2000</xref>
) propose an adaptive population of agents and search for pages relevant to a domain using evolving query vectors and neural nets to decide which links to follow.</p>
<p>In other approaches (Dziwiński and Rutkowska
<xref ref-type="bibr" rid="CR21">2008</xref>
; Gao et al.
<xref ref-type="bibr" rid="CR28">2010</xref>
), the selection of the next links is also influenced by the distance between relevant pages (i.e., the number of links the crawler must follow in order to visit a particular page starting from another relevant page). A general framework, which defines crawling tasks of variable difficulty and fairly evaluates focused crawling algorithms under a number of performance metrics (precision and recall, relevance, algorithmic efficiency, etc.) was proposed by Srinivasan et al. (
<xref ref-type="bibr" rid="CR63">2005</xref>
).</p>
<p>Another challenging task in producing good-quality language resources from the web is the removal of parts of the web page such as navigation links, advertisements, disclaimers, etc. (often called boilerplate), since they are of only limited or no value for the purposes of studying language use and change (Kilgarriff and Grefenstette
<xref ref-type="bibr" rid="CR33">2003</xref>
) or for training an MT system. A review of cleaning methods is presented by Spousta et al. (
<xref ref-type="bibr" rid="CR62">2008</xref>
), among others.</p>
<p>Apart from the crawling algorithm, classification of web content as relevant or otherwise affects the acquisition of domain-specific resources, on the assumption that relevant pages are more likely to contain links to more pages in the same domain. Qi and Davison (
<xref ref-type="bibr" rid="CR56">2009</xref>
) review features and algorithms used in web page classification. Most of the reviewed algorithms apply supervised machine-learning methods (support vector machines, decision trees, neural networks, etc.) on feature vectors consisting of on-page features, such as textual content and HTML tags (Yu et al.
<xref ref-type="bibr" rid="CR70">2004</xref>
). Many algorithms exploit additional information contained in web pages, including anchor text of hyperlinks. Some methods adopt the assumption that neighbouring pages are likely to be in the same domain (Menczer
<xref ref-type="bibr" rid="CR43">2005</xref>
).</p>
<p>The WebBootCat toolkit (Baroni et al.
<xref ref-type="bibr" rid="CR7">2006</xref>
) harvests domain-specific data from the web by querying search engines with tuples of in-domain terms. Combine
<xref ref-type="fn" rid="Fn2">2</xref>
is an open-source focused crawler based on a combination of a general web crawler and a topic classifier. Efficient focused web crawlers can be built by adapting existing open-source frameworks such as Heritrix,
<xref ref-type="fn" rid="Fn3">3</xref>
Nutch,
<xref ref-type="fn" rid="Fn4">4</xref>
and Bixo.
<xref ref-type="fn" rid="Fn5">5</xref>
</p>
</sec>
<sec id="Sec4">
<title>Web crawling for parallel texts</title>
<p>Compared to crawling for monolingual data, acquisition of parallel texts from the web is even more challenging. Even though there are many multilingual websites with pairs of pages that are translations of each other, detection of such sites and identification of the pairs is far from straightforward.</p>
<p>Considering the web as a parallel corpus, Resnik and Smith (
<xref ref-type="bibr" rid="CR58">2003</xref>
) present the STRAND system, in which they use a search engine to search for multilingual websites and examine the similarity of the HTML structures of the fetched web pages in order to identify pairs of potentially parallel pages. Besides structural similarity, systems such as PTMiner (Nie et al.
<xref ref-type="bibr" rid="CR48">1999</xref>
) and WeBiText (Désilets et al.
<xref ref-type="bibr" rid="CR19">2008</xref>
) filtered fetched web pages by keeping only those containing language markers in their URLs. Chen et al. (
<xref ref-type="bibr" rid="CR16">2004</xref>
) proposed the Parallel Text Identification System, which incorporated a content analysis module using a predefined bilingual wordlist. Similarly, Zhang et al. (
<xref ref-type="bibr" rid="CR71">2006</xref>
) adopted a naive aligner in order to estimate the content similarity of candidate parallel web pages. Esplà-Gomis and Forcada (
<xref ref-type="bibr" rid="CR23">2010</xref>
) developed Bitextor, a system combining language identification with shallow features (file size, text length, tag structure, and list of numbers in a web page) to mine parallel pages from multilingual sites that have been already been stored locally with the HTTrack
<xref ref-type="fn" rid="Fn6">6</xref>
website copier. Barbosa et al. (
<xref ref-type="bibr" rid="CR6">2012</xref>
) crawl the web and examine the HTML DOM tree of visited web pages with the purpose of detecting multilingual websites based on the collation of links that are very likely to point to in-site pages in different languages. Once a multilingual site is detected, they use an intra-site crawler and alignment procedures to harvest parallel text for multiple pairs of languages.</p>
</sec>
<sec id="Sec5">
<title>Phrase-based statistical machine translation</title>
<p>In PB-SMT (e.g., Moses (Koehn et al.
<xref ref-type="bibr" rid="CR39">2007</xref>
)), an input sentence is segmented into sequences of consecutive words, called phrases. Each phrase is then translated into a target-language phrase, which may be reordered with other translated phrases to produce the output.</p>
<p>Formally, the model is based on the noisy channel model. The translation
<inline-formula id="IEq1">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\mathbf e} $$\end{document}</tex-math>
<mml:math id="M2">
<mml:mi mathvariant="bold">e</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
of an input sentence
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\mathbf f} $$\end{document}</tex-math>
<mml:math id="M4">
<mml:mi mathvariant="bold">f</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
is searched for by maximising the translation probability
<inline-formula id="IEq3">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ p({\mathbf e} |{\mathbf f} )$$\end{document}</tex-math>
<mml:math id="M6">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold">f</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq3.gif"></inline-graphic>
</alternatives>
</inline-formula>
formulated as a log-linear combination of a set of feature functions
<inline-formula id="IEq4">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_i$$\end{document}</tex-math>
<mml:math id="M8">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq4.gif"></inline-graphic>
</alternatives>
</inline-formula>
and their weights
<inline-formula id="IEq5">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _i$$\end{document}</tex-math>
<mml:math id="M10">
<mml:msub>
<mml:mi mathvariant="italic">λ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq5.gif"></inline-graphic>
</alternatives>
</inline-formula>
:
<disp-formula id="Equ1">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} p({\mathbf e} |{\mathbf f} ) = \prod _{i=1}^n h_i({\mathbf e} ,{\mathbf f} )^{\lambda _i}. \end{aligned}$$\end{document}</tex-math>
<mml:math id="M12" display="block">
<mml:mrow>
<mml:mtable columnspacing="0.5ex">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:mi mathvariant="bold">f</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:munderover>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold">e</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">f</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi mathvariant="italic">λ</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:msup>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
<graphic xlink:href="10579_2014_9282_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
Typically, the components include features of the following models (the symbols in brackets refer to the actual features used in our experiments described in Sects.
<xref rid="Sec12" ref-type="sec">4</xref>
<xref rid="Sec22" ref-type="sec">6</xref>
):
<italic>reordering</italic>
(
<italic>distortion</italic>
)
<italic>model</italic>
(
<inline-formula id="IEq6">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_1$$\end{document}</tex-math>
<mml:math id="M14">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq6.gif"></inline-graphic>
</alternatives>
</inline-formula>
<inline-formula id="IEq7">
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_7$$\end{document}</tex-math>
<mml:math id="M16">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>7</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
), which allows the reordering of phrases in the input sentences (e.g., distance-based and lexicalised reordering),
<italic>language model</italic>
(
<inline-formula id="IEq8">
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_8$$\end{document}</tex-math>
<mml:math id="M18">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>8</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq8.gif"></inline-graphic>
</alternatives>
</inline-formula>
), which ensures that the translations are fluent,
<italic>phrase translation model</italic>
(
<inline-formula id="IEq9">
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_8$$\end{document}</tex-math>
<mml:math id="M20">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>8</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq9.gif"></inline-graphic>
</alternatives>
</inline-formula>
<inline-formula id="IEq10">
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{12}$$\end{document}</tex-math>
<mml:math id="M22">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>12</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq10.gif"></inline-graphic>
</alternatives>
</inline-formula>
), which ensures that the source and target phrases are good translations of each other (e.g., direct and inverse phrase translation probability, direct and indirect lexical weighting, and phrase penalty),
<italic>phrase penalty</italic>
(
<inline-formula id="IEq11">
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{13}$$\end{document}</tex-math>
<mml:math id="M24">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>13</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq11.gif"></inline-graphic>
</alternatives>
</inline-formula>
), which controls the number of phrases the translation consists of, and
<italic>word penalty</italic>
(
<inline-formula id="IEq12">
<alternatives>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{14}$$\end{document}</tex-math>
<mml:math id="M26">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>14</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq12.gif"></inline-graphic>
</alternatives>
</inline-formula>
), which prevents the translations from being too long or too short.</p>
<p>The weights of the log-linear combination influence overall translation quality; however, the optimal setting depends on the translation direction and data. A common solution to optimise weights is to use Minimum Error Rate Training (MERT: Och
<xref ref-type="bibr" rid="CR49">2003</xref>
), which automatically searches for the values that minimise a given error measure (or maximise a given translation quality measure) on a development set of parallel sentences. Theoretically, any automatic measure can be used for this purpose; however, the most commonly used is BLEU (Papineni et al.
<xref ref-type="bibr" rid="CR50">2002</xref>
). The search algorithm is a type of coordinate ascent: considering the
<inline-formula id="IEq13">
<alternatives>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document}</tex-math>
<mml:math id="M28">
<mml:mi>n</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq13.gif"></inline-graphic>
</alternatives>
</inline-formula>
-best translation hypotheses for each input sentence, it updates the feature weight which is most likely to improve the objective and iterates until convergence. The error surface is highly non-convex. Since the algorithm cannot explore the whole parameter space, it may converge to a local maximum. In practice, however, it often produces good results (Bertoldi et al.
<xref ref-type="bibr" rid="CR11">2009</xref>
).</p>
</sec>
<sec id="Sec6">
<title>Domain adaptation in statistical machine translation</title>
<p>Domain adaptation is a very active research topic within the area of SMT. Three main topics can be identified depending on the availability of domain-specific data: (1) if any in-domain data is available, it can be directly used to improve the MT system, e.g., by combining the (limited) in-domain with (more extensive) out-of-domain resources for training; (2) if in-domain data exists but is not readily available, one may attempt to acquire domain-specific data (e.g., from the web, which is the case of our work); (3) finally, if sources of in-domain data cannot be identified, one may attempt to select pseudo in-domain data (Axelrod et al.
<xref ref-type="bibr" rid="CR2">2011</xref>
) from general-domain sources. Below, we review a selection of relevant work that falls into these topics.</p>
<p>The first attempt to perform domain adaptation was carried out by Langlais (
<xref ref-type="bibr" rid="CR41">2002</xref>
), who integrated in-domain lexicons in the translation model. Wu and Wang (
<xref ref-type="bibr" rid="CR68">2004</xref>
) used in-domain data to improve word alignment in the training phase. Much work on domain adaptation in the interim has looked at mixture modelling, whereby separate models are built for each data set (e.g., in-domain and out-of-domain) which are then interpolated. There have been attempts to combine both language models (Koehn and Schroeder
<xref ref-type="bibr" rid="CR38">2007</xref>
) and translation models (Nakov
<xref ref-type="bibr" rid="CR47">2008</xref>
; Sanchis-Trilles and Casacuberta
<xref ref-type="bibr" rid="CR59">2010</xref>
; Bisazza et al.
<xref ref-type="bibr" rid="CR12">2011</xref>
). The features of the different models can be combined by linear or log-linear interpolation (Foster and Kuhn
<xref ref-type="bibr" rid="CR26">2007</xref>
; Banerjee et al.
<xref ref-type="bibr" rid="CR4">2011</xref>
). Ways to optimize interpolation weights include the minimization of the model perplexity on a development set (Sennrich
<xref ref-type="bibr" rid="CR60">2012</xref>
) and the maximization of an evaluation metric (Haddow
<xref ref-type="bibr" rid="CR29">2013</xref>
). Mixture model techniques have been applied to a number of scenarios, including the combination of different kinds of data (e.g., questions and declarative sentences, Finch and Sumita
<xref ref-type="bibr" rid="CR24">2008</xref>
) and the combination of different types of translation models (e.g., surface form and factored, Koehn and Haddow
<xref ref-type="bibr" rid="CR37">2012</xref>
).</p>
<p>A second strand towards domain adaptation regards the acquisition of in-domain data. Munteanu and Marcu (
<xref ref-type="bibr" rid="CR46">2005</xref>
) extract in-domain sentence pairs from comparable corpora. Daumé III and Jagarlamudi (
<xref ref-type="bibr" rid="CR18">2011</xref>
) attempt to reduce OOV terms when targeting a specific domain by mining their translations from comparable corpora. Bertoldi and Federico (
<xref ref-type="bibr" rid="CR10">2009</xref>
) rely on large amounts of in-domain monolingual data to create synthetic parallel corpora for training. Pecina et al. (
<xref ref-type="bibr" rid="CR51">2011</xref>
) exploit automatically web-crawled in-domain resources for parameter optimisation and to improve language models. Pecina et al. (
<xref ref-type="bibr" rid="CR53">2012a</xref>
) extend this work by using the web-crawled resources to also improve translation models.</p>
<p>The selection of pseudo in-domain data is another approach to domain-adaptation based on the assumption that a sufficiently broad general-domain corpus will include sentences that resemble the target domain. Eck et al. (
<xref ref-type="bibr" rid="CR22">2004</xref>
) present a technique for adapting the language model by selecting similar sentences from available training data. Hildebrand et al. (
<xref ref-type="bibr" rid="CR31">2005</xref>
) extended this approach to the translation model. Foster et al. (
<xref ref-type="bibr" rid="CR27">2010</xref>
) weight phrase pairs from out-of-domain corpora according to their relevance to the target domain. Moore and Lewis (
<xref ref-type="bibr" rid="CR45">2010</xref>
) used difference of cross-entropy given an in-domain model and general-domain model to filter monolingual data for language modelling. Axelrod et al. (
<xref ref-type="bibr" rid="CR2">2011</xref>
) used a similar approach to filter parallel training data. Recent works extend the cross-entropy approach by combining this score with scores based on quality estimation (Banerjee et al.
<xref ref-type="bibr" rid="CR5">2013</xref>
) and translation models (Mansour et al.
<xref ref-type="bibr" rid="CR42">2011</xref>
) and by using linguistic units instead of surface forms to perform the selection (Toral
<xref ref-type="bibr" rid="CR66">2013</xref>
).</p>
<p>In a recent workshop held to better understand and address issues that arise in domain adaptation for MT (Carpuat et al.
<xref ref-type="bibr" rid="CR15">2012</xref>
), the use of phrase-sense disambiguation (Carpuat and Wu
<xref ref-type="bibr" rid="CR14">2007</xref>
) to model content in SMT was investigated, with the conclusion that it can successfully model lexical choice across domains. In addition, a method for translation mining based on document-pair marginal matching was developed, with the aim of acquiring useful translations for OOVs from comparable and parallel data.</p>
</sec>
</sec>
<sec id="Sec7">
<title>Domain-focused web crawling for monolingual and parallel data</title>
<p>Domain-focused web crawling aims to visit (and store) web pages relevant to a specific domain only. A critical issue is the construction of the domain definition (see
<xref rid="App1" ref-type="app">Appendix</xref>
), since each web page visited by the crawler should be classified as relevant or non-relevant to the domain with respect to this definition. As we did not possess training data for the domains and languages targeted in our experiments, we followed the approach of Ardö and Golub (
<xref ref-type="bibr" rid="CR1">2007</xref>
) and represented each domain as a list of weighted terms. Formally, the domain definition consists of triplets 〈
<italic>relevance weight, term, domain or subdomain(s)</italic>
〉. If the terms are publicly available online, as is often the case, this approach does not require any domain expertise.</p>
<p>For our experiments, we selected English, French, and Greek terms (both single- and multi-word entries) from the “Environment” (536, 277, and 513 terms respectively) and “Employment and Working Conditions” (134, 96, and 157 terms respectively) domains of the EuroVoc
<xref ref-type="fn" rid="Fn7">7</xref>
thesaurus v4.3. The EuroVoc structure also allowed us to automatically assign each term to one or more of the following subdomains: natural environment, deterioration of the environment, environmental policy, energy policy and cultivation of agricultural land for
<italic>env</italic>
; labour law and labour relations, organisation of work and working conditions, personnel management and staff remuneration, employment and labour market for
<italic>lab</italic>
. Information about subdomains can prove useful in acquiring more focused collections.
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>An extract of an example English definition manually constructed for the environment domain</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Weight</th>
<th align="left">Term</th>
<th align="left">Subdomain(s)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">80</td>
<td align="left">Desertification</td>
<td align="left">Deterioration of the environment; natural environment</td>
</tr>
<tr>
<td align="left">80</td>
<td align="left">Available energy resources</td>
<td align="left">Energy policy; natural environment</td>
</tr>
<tr>
<td align="left">100</td>
<td align="left">Biodiversity</td>
<td align="left">Natural environment</td>
</tr>
<tr>
<td align="left">50</td>
<td align="left">Clean industry</td>
<td align="left">Environmental policy</td>
</tr>
<tr>
<td align="left">70</td>
<td align="left">Deforestation</td>
<td align="left">Cultivation of agricultural land; deterioration of the environment</td>
</tr>
<tr>
<td align="left">−100</td>
<td align="left">Music</td>
<td align="left"></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Each entry was manually assigned a weight indicating the term’s domain relevance, with higher values denoting more relevant terms. Even though a domain expert is required to differentiate relevant terms and assign various weights to them, initial experiments showed that a domain-specific corpus can be constructed (see Sect.
<xref rid="Sec8" ref-type="sec">3.1</xref>
) by using a unique positive weight at the scale of 100. In case of ambiguous terms (e.g., “heavy metal” as a music genre and as an element dangerous for the environment), a user could either exclude this term from the domain definition or assign a negative weight to a term closely related to the ambiguous term’s unwanted reading (i.e., include the term “music” and assign it a negative weight) in order to penalize occurrences of this term. For illustration, a sample from the definition for the
<italic>env</italic>
domain is given in Table 
<xref rid="Tab1" ref-type="table">1</xref>
.</p>
<sec id="Sec8">
<title>Acquisition of monolingual texts</title>
<p>In order to acquire in-domain corpora from the web, we implemented an open-source focused crawler (Papavassiliou et al.
<xref ref-type="bibr" rid="CR72">2013</xref>
). The crawler adopts a distributed computing architecture based on Bixo, an open-source web-mining toolkit running on top of Hadoop
<xref ref-type="fn" rid="Fn8">8</xref>
and making use of ideas from the Nutch and Heritrix web crawlers. In addition, the crawler integrates procedures for normalisation, language identification, boilerplate removal, text classification and URL ranking. Users can configure several settings related to focused crawling (i.e., number of concurrent harvesters, filtering out specific document types, required number of terms, etc.) For the acquisition of monolingual corpora, we used the focused crawler's monolingual mode of operation (FMC), which is also available as a web service.
<xref ref-type="fn" rid="Fn9">9</xref>
</p>
<p>To initialise the crawler for the
<italic>env</italic>
domain, we constructed lists of seed URLs selected from relevant lists in the Open Directory Project.
<xref ref-type="fn" rid="Fn10">10</xref>
Alternative resources include the Yahoo
<xref ref-type="fn" rid="Fn11">11</xref>
directory. For the
<italic>lab</italic>
domain, similar lists were not so easy to find. The seed lists were therefore generated from queries for random combinations of terms using the WebBootCat toolkit (Baroni et al.
<xref ref-type="bibr" rid="CR7">2006</xref>
). When a page is fetched by the crawler, it is parsed in order to extract its metadata and content and normalised to the UTF-8 encoding. Next, the language is identified using the
<italic>n</italic>
-gram-based method included in the Apache Tika toolkit.
<xref ref-type="fn" rid="Fn12">12</xref>
In order to detect parts of text not in the targeted language, the language identifier is also applied at paragraph level and these parts are marked as such. The next processing step concerns boilerplate detection. For this task, we used a modified version of Boilerpipe (Kohlschütter et al.
<xref ref-type="bibr" rid="CR40">2010</xref>
), which also extracts structural information (such as title, heading, and list item), and segments text in paragraphs exploiting HTML tags. Paragraphs judged to be boilerplate are filtered out and each normalised page is then compared to the domain definition.
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Statistics from the trial and main phase of crawling of monolingual data:
<italic>pages stored</italic>
refers to the subset of
<italic>pages visited</italic>
and classified as in-domain,
<italic>pages deduped</italic>
refers to the pages after near-duplicate removal,
<italic>time</italic>
is total duration (in hours), and
<italic>acc</italic>
is accuracy estimated on the
<italic>pages sampled</italic>
that were crawled and classified during the trialphase</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="2">Language</th>
<th align="left" rowspan="2">Domain</th>
<th align="left" colspan="4">Trial phase</th>
<th align="left" colspan="7">Main phase</th>
</tr>
<tr>
<th align="left">Sites</th>
<th align="left">Pages all</th>
<th align="left">Pages sampled</th>
<th align="left">Acc (%)</th>
<th align="left">Sites</th>
<th align="left">Pages visited</th>
<th align="left">Pages stored</th>
<th align="left">
<inline-formula id="IEq17">
<alternatives>
<tex-math id="M29">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\varDelta \%)$$\end{document}</tex-math>
<mml:math id="M30">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">Δ</mml:mi>
<mml:mo>%</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq17.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">Pages deduped</th>
<th align="left">
<inline-formula id="IEq18">
<alternatives>
<tex-math id="M31">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(\varDelta \%)$$\end{document}</tex-math>
<mml:math id="M32">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">Δ</mml:mi>
<mml:mo>%</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq18.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">time (h)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="2">English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">146</td>
<td char="." align="char">505</td>
<td char="." align="char">224</td>
<td char="." align="char">
<italic>92.9</italic>
</td>
<td char="." align="char">3,181</td>
<td char="." align="char">90,240</td>
<td char="." align="char">34,572</td>
<td char="." align="char">
<italic>38.3</italic>
</td>
<td char="." align="char">28,071</td>
<td char="." align="char">
<italic>18.8</italic>
</td>
<td char="." align="char">47</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">150</td>
<td char="." align="char">461</td>
<td char="." align="char">215</td>
<td char="." align="char">
<italic>91.6</italic>
</td>
<td char="." align="char">1,614</td>
<td char="." align="char">121,895</td>
<td char="." align="char">22,281</td>
<td char="." align="char">
<italic>18.3</italic>
</td>
<td char="." align="char">15,197</td>
<td char="." align="char">
<italic>31.8</italic>
</td>
<td char="." align="char">50</td>
</tr>
<tr>
<td align="left" rowspan="2">French</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">106</td>
<td char="." align="char">543</td>
<td char="." align="char">232</td>
<td char="." align="char">
<italic>95.7</italic>
</td>
<td char="." align="char">2,016</td>
<td char="." align="char">160,059</td>
<td char="." align="char">35,488</td>
<td char="." align="char">
<italic>22.2</italic>
</td>
<td char="." align="char">23,514</td>
<td char="." align="char">
<italic>33.7</italic>
</td>
<td char="." align="char">67</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">64</td>
<td char="." align="char">839</td>
<td char="." align="char">268</td>
<td char="." align="char">
<italic>98.1</italic>
</td>
<td char="." align="char">1,404</td>
<td char="." align="char">186,748</td>
<td char="." align="char">45,660</td>
<td char="." align="char">
<italic>27.2</italic>
</td>
<td char="." align="char">26,675</td>
<td char="." align="char">
<italic>41.6</italic>
</td>
<td char="." align="char">72</td>
</tr>
<tr>
<td align="left" rowspan="2">Greek</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">112</td>
<td char="." align="char">524</td>
<td char="." align="char">227</td>
<td char="." align="char">
<italic>97.4</italic>
</td>
<td char="." align="char">1,104</td>
<td char="." align="char">113,737</td>
<td char="." align="char">31,524</td>
<td char="." align="char">
<italic>27.7</italic>
</td>
<td char="." align="char">16,073</td>
<td char="." align="char">
<italic>49.0</italic>
</td>
<td char="." align="char">48</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">117</td>
<td char="." align="char">481</td>
<td char="." align="char">219</td>
<td char="." align="char">
<italic>88.1</italic>
</td>
<td char="." align="char">660</td>
<td char="." align="char">97,847</td>
<td char="." align="char">19,474</td>
<td char="." align="char">
<italic>19.9</italic>
</td>
<td char="." align="char">7,124</td>
<td char="." align="char">
<italic>63.4</italic>
</td>
<td char="." align="char">38</td>
</tr>
<tr>
<td align="left">Average</td>
<td align="left"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char">
<italic>94.0</italic>
</td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char">
<italic>25.6</italic>
</td>
<td char="." align="char"></td>
<td char="." align="char">
<italic>39.7</italic>
</td>
<td char="." align="char"></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>The comparison to the domain definition is based on the amount of term occurrences, their location in the web page (i.e., title, keywords, body) and their weights. The page relevance score
<inline-formula id="IEq19">
<alternatives>
<tex-math id="M33">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}</tex-math>
<mml:math id="M34">
<mml:mi>p</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq19.gif"></inline-graphic>
</alternatives>
</inline-formula>
is calculated as proposed by Ardö and Golub (
<xref ref-type="bibr" rid="CR1">2007</xref>
):
<disp-formula id="Equ2">
<alternatives>
<tex-math id="M35">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} p = \sum \limits _{i=1}^N \sum \limits _{j=1}^4 n_{ij}\cdot w_{i}^t\cdot w_{j}^l, \end{aligned}$$\end{document}</tex-math>
<mml:math id="M36" display="block">
<mml:mrow>
<mml:mtable columnspacing="0.5ex">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo movablelimits="false"></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:munderover>
<mml:munderover>
<mml:mo movablelimits="false"></mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mn>4</mml:mn>
</mml:munderover>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:msubsup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
<mml:mo>·</mml:mo>
<mml:msubsup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mi>l</mml:mi>
</mml:msubsup>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
<graphic xlink:href="10579_2014_9282_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<inline-formula id="IEq20">
<alternatives>
<tex-math id="M37">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N$$\end{document}</tex-math>
<mml:math id="M38">
<mml:mi>N</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq20.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the amount of terms in the domain definition,
<inline-formula id="IEq21">
<alternatives>
<tex-math id="M39">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_{i}^t$$\end{document}</tex-math>
<mml:math id="M40">
<mml:msubsup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mi>t</mml:mi>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq21.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the weight of term
<inline-formula id="IEq22">
<alternatives>
<tex-math id="M41">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i$$\end{document}</tex-math>
<mml:math id="M42">
<mml:mi>i</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq22.gif"></inline-graphic>
</alternatives>
</inline-formula>
,
<inline-formula id="IEq23">
<alternatives>
<tex-math id="M43">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_{j}^l$$\end{document}</tex-math>
<mml:math id="M44">
<mml:msubsup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mi>l</mml:mi>
</mml:msubsup>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq23.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the weight of location
<inline-formula id="IEq24">
<alternatives>
<tex-math id="M45">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j$$\end{document}</tex-math>
<mml:math id="M46">
<mml:mi>j</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq24.gif"></inline-graphic>
</alternatives>
</inline-formula>
, and
<inline-formula id="IEq25">
<alternatives>
<tex-math id="M47">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{ij}$$\end{document}</tex-math>
<mml:math id="M48">
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq25.gif"></inline-graphic>
</alternatives>
</inline-formula>
denotes the number of occurrences of term
<inline-formula id="IEq26">
<alternatives>
<tex-math id="M49">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i$$\end{document}</tex-math>
<mml:math id="M50">
<mml:mi>i</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq26.gif"></inline-graphic>
</alternatives>
</inline-formula>
in location
<inline-formula id="IEq27">
<alternatives>
<tex-math id="M51">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$j$$\end{document}</tex-math>
<mml:math id="M52">
<mml:mi>j</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq27.gif"></inline-graphic>
</alternatives>
</inline-formula>
. The four discrete locations in a web page are
<italic>title</italic>
,
<italic>metadata</italic>
,
<italic>keywords</italic>
, and
<italic>html body</italic>
, with respective weights of 10, 4, 2, and 1 as proposed by Ardö and Golub (
<xref ref-type="bibr" rid="CR1">2007</xref>
). If
<inline-formula id="IEq28">
<alternatives>
<tex-math id="M53">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}</tex-math>
<mml:math id="M54">
<mml:mi>p</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq28.gif"></inline-graphic>
</alternatives>
</inline-formula>
is higher than a predefined threshold, the web page is classified as relevant to the domain and stored. The threshold is the minimum number of terms to be found (the default value is 3) multiplied by the median value of the weights of all terms in the domain definition. It is worth mentioning that the user can favour precision over recall by setting the number of terms in the crawler’s configuration file. Similarly, the page relevance score to each subdomain is calculated and if this score is higher than the threshold, the web page is also classified as relevant to the corresponding subdomain(s). Otherwise the document is considered to be in the “unknown” subdomain.</p>
<p>Even when a page is not classified as relevant, it is still parsed and its links are extracted and added to the list of links to be visited. The fact that we keep links from non-relevant pages allows us to exploit the Tunnelling strategy (Bergmark et al.
<xref ref-type="bibr" rid="CR9">2002</xref>
), according to which the crawler does not give up examining a path when it encounters an irrelevant page. Instead, it continues searching that path for a predefined number of steps (the default value is 4), which allows the crawler to travel from one relevant web cluster to another when the number of irrelevant pages between them is beneath some threshold.</p>
<p>Although it is important to prevent the crawler from being ‘choked’, it is critical for crawl evolution to force the crawler to first follow links pointing to relevant pages. Therefore, we also adopted the Best-First algorithm in our implementation since this strategy is considered the baseline for almost all relevant related work. To this end, a link relevance score
<inline-formula id="IEq29">
<alternatives>
<tex-math id="M55">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l$$\end{document}</tex-math>
<mml:math id="M56">
<mml:mi>l</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq29.gif"></inline-graphic>
</alternatives>
</inline-formula>
influenced by the source web page relevance score
<inline-formula id="IEq30">
<alternatives>
<tex-math id="M57">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p$$\end{document}</tex-math>
<mml:math id="M58">
<mml:mi>p</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq30.gif"></inline-graphic>
</alternatives>
</inline-formula>
and the estimated relevance of the link’s surrounding text is calculated as
<disp-formula id="Equ3">
<alternatives>
<tex-math id="M59">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\begin{aligned} l = \frac{p}{N} + \sum _{i=1}^M n_{i}\cdot w_{i}, \end{aligned}$$\end{document}</tex-math>
<mml:math id="M60" display="block">
<mml:mrow>
<mml:mtable columnspacing="0.5ex">
<mml:mtr>
<mml:mtd columnalign="right">
<mml:mrow>
<mml:mi>l</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mi>p</mml:mi>
<mml:mi>N</mml:mi>
</mml:mfrac>
<mml:mo>+</mml:mo>
<mml:munderover>
<mml:mo></mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:munderover>
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>·</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
<graphic xlink:href="10579_2014_9282_Article_Equ3.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where
<inline-formula id="IEq31">
<alternatives>
<tex-math id="M61">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N$$\end{document}</tex-math>
<mml:math id="M62">
<mml:mi>N</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq31.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the amount of links originating from the source page,
<inline-formula id="IEq32">
<alternatives>
<tex-math id="M63">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$M$$\end{document}</tex-math>
<mml:math id="M64">
<mml:mi>M</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq32.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the amount of terms in the domain definition,
<inline-formula id="IEq33">
<alternatives>
<tex-math id="M65">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_{i}$$\end{document}</tex-math>
<mml:math id="M66">
<mml:msub>
<mml:mi>n</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq33.gif"></inline-graphic>
</alternatives>
</inline-formula>
denotes the number of occurrences of the
<inline-formula id="IEq34">
<alternatives>
<tex-math id="M67">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i$$\end{document}</tex-math>
<mml:math id="M68">
<mml:mi>i</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq34.gif"></inline-graphic>
</alternatives>
</inline-formula>
-th term in the surrounding text and
<inline-formula id="IEq35">
<alternatives>
<tex-math id="M69">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_{i}$$\end{document}</tex-math>
<mml:math id="M70">
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq35.gif"></inline-graphic>
</alternatives>
</inline-formula>
is the weight of the
<inline-formula id="IEq36">
<alternatives>
<tex-math id="M71">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$i$$\end{document}</tex-math>
<mml:math id="M72">
<mml:mi>i</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq36.gif"></inline-graphic>
</alternatives>
</inline-formula>
-th term. This formulation of the link score was inspired by the conclusion of Cho et al. (
<xref ref-type="bibr" rid="CR17">1998</xref>
), who stated that using a similarity metric that considers the content of anchors tends to produce some amount of differentiation between out-links and forces the crawler to visit relevant web pages earlier. New and unvisited links are merged and then sorted by their scores so that the most promising links are selected first for the next cycle. The statistics from the acquisition procedure are provided in Table 
<xref rid="Tab2" ref-type="table">2</xref>
.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Visualisation of temporal precision (ratio of stored/visited pages per cycle) during three crawls</p>
</caption>
<graphic xlink:href="10579_2014_9282_Fig1_HTML" id="MO4"></graphic>
</fig>
</p>
<p>In order to estimate the crawler’s accuracy in acquiring in-domain resources, we first ran trial crawls in English, French, and Greek for the
<italic>env</italic>
and
<italic>lab</italic>
domains and asked native speakers to classify a sample of the acquired documents as domain-relevant or not based on provided domain descriptions (see
<xref rid="App1" ref-type="app">Appendix</xref>
). The results of the trial phase are given in columns 3–6 in Table 
<xref rid="Tab2" ref-type="table">2</xref>
). The average accuracy over all data sets is 94.0 % (see column 6).</p>
<p>Then we repeated the crawls to acquire larger collections (see columns 7–13). Duplicate web pages were detected and removed based on MD5 hashes, and near-duplicates were eliminated by employing the deduplication strategy implemented in the Nutch framework, which involves construction of a text profile based on quantised word frequencies.</p>
<p>As shown in column 10 of Table 
<xref rid="Tab2" ref-type="table">2</xref>
, the average precision at the end of the crawl jobs is about 25 %, a result similar to the conclusions reached by Srinivasan et al. (
<xref ref-type="bibr" rid="CR63">2005</xref>
) and Dorado (
<xref ref-type="bibr" rid="CR20">2008</xref>
). Figure
<xref rid="Fig1" ref-type="fig">1</xref>
further illustrates the variation of the crawler’s temporal precision (i.e., the ratio of stored over visited pages after each crawling cycle) during the evolution of 3 crawls, where the average temporal precision remains above 20 % after 400 crawling cycles (the default value of the maximum number of URLs to be visited per cycle is 256).</p>
<p>The
<inline-formula id="IEq37">
<alternatives>
<tex-math id="M73">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta \hbox {s}$$\end{document}</tex-math>
<mml:math id="M74">
<mml:mrow>
<mml:mi mathvariant="italic">Δ</mml:mi>
<mml:mtext>s</mml:mtext>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq37.gif"></inline-graphic>
</alternatives>
</inline-formula>
of the 12th column in Table 
<xref rid="Tab2" ref-type="table">2</xref>
refer to the percentage of documents removed during deduplication. The relatively high percentages of documents removed during deduplication is in accordance with the observation of Baroni et al. (
<xref ref-type="bibr" rid="CR8">2009</xref>
), where during compilation of the Wacky corpora the amount of documents was reduced by more than 50 % following deduplication. Another observation is that the percentages of duplicates for the
<italic>lab</italic>
domain are much higher than the ones for
<italic>env</italic>
for all languages. This is explained by the fact that the web pages related to
<italic>lab</italic>
are often legal documents or press releases replicated on many websites.</p>
<p>The final processing of the monolingual data was performed on paragraphs marked by Boilerpipe and the language identifier. The statistics from this phase are presented in Table 
<xref rid="Tab3" ref-type="table">3</xref>
. Firstly, we discarded all paragraphs in languages different from the targeted ones as well as those classified as boilerplate, which reduced their total amount to 23.3 % on average. Removal of duplicate paragraphs then reduced the total number of paragraphs to 14 % on average. Most of the removed paragraphs, however, were very short chunks of text. In terms of tokens, the reduction is only 50.6 %. The last three columns in Table 
<xref rid="Tab3" ref-type="table">3</xref>
refer to the final monolingual data sets used for training language models. For English and French, we acquired about 45 million tokens for each domain; for Greek, which is less frequent on the web, we obtained only about 15 and 20 million tokens for
<italic>lab</italic>
and
<italic>env</italic>
, respectively. These datasets are available from the ELRA catalogue
<xref ref-type="fn" rid="Fn13">13</xref>
under reference numbers ELRA-W00063–ELRA-W00068.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Statistics from the cleaning stage of monolingual data acquisition and of the final data set:
<italic>paragraphs clean</italic>
refers to the paragraphs classified as non-boilerplate, and
<italic>paragraphs unique</italic>
to those obtained after duplicate removal</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Language</th>
<th align="left">Domain</th>
<th align="left">Paragraphs all</th>
<th align="left">Paragraphs cleaned</th>
<th align="left">(
<inline-formula id="IEq38">
<alternatives>
<tex-math id="M75">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta \%$$\end{document}</tex-math>
<mml:math id="M76">
<mml:mrow>
<mml:mi mathvariant="italic">Δ</mml:mi>
<mml:mo>%</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq38.gif"></inline-graphic>
</alternatives>
</inline-formula>
)</th>
<th align="left">Paragraphs unique</th>
<th align="left">(
<inline-formula id="IEq39">
<alternatives>
<tex-math id="M77">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta \%$$\end{document}</tex-math>
<mml:math id="M78">
<mml:mrow>
<mml:mi mathvariant="italic">Δ</mml:mi>
<mml:mo>%</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq39.gif"></inline-graphic>
</alternatives>
</inline-formula>
)</th>
<th align="left">Sentences</th>
<th align="left">Tokens</th>
<th align="left">Vocabulary</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="2">English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">5,841,059</td>
<td char="." align="char">1,088,660</td>
<td char="." align="char">
<italic>18.6</italic>
</td>
<td char="." align="char">693,971</td>
<td char="." align="char">
<italic>11.9</italic>
</td>
<td char="." align="char">1,700,436</td>
<td char="." align="char">44,853,229</td>
<td char="." align="char">225,650</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">3,447,451</td>
<td char="." align="char">896,369</td>
<td char="." align="char">
<italic>26.0</italic>
</td>
<td char="." align="char">609,696</td>
<td char="." align="char">
<italic>17.7</italic>
</td>
<td char="." align="char">1,407,448</td>
<td char="." align="char">43,726,781</td>
<td char="." align="char">136,678</td>
</tr>
<tr>
<td align="left" rowspan="2">French</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">4,440,033</td>
<td char="." align="char">1,069,889</td>
<td char="." align="char">
<italic>24.1</italic>
</td>
<td char="." align="char">666,553</td>
<td char="." align="char">
<italic>15.0</italic>
</td>
<td char="." align="char">1,235,107</td>
<td char="." align="char">42,780,009</td>
<td char="." align="char">246,177</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">5,623,427</td>
<td char="." align="char">1,382,420</td>
<td char="." align="char">
<italic>24.6</italic>
</td>
<td char="." align="char">822,201</td>
<td char="." align="char">
<italic>14.6</italic>
</td>
<td char="." align="char">1,232,707</td>
<td char="." align="char">46,992,912</td>
<td char="." align="char">180,628</td>
</tr>
<tr>
<td align="left" rowspan="2">Greek</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">3,023,295</td>
<td char="." align="char">672,763</td>
<td char="." align="char">
<italic>22.3</italic>
</td>
<td char="." align="char">352,017</td>
<td char="." align="char">
<italic>11.6</italic>
</td>
<td char="." align="char">655,353</td>
<td char="." align="char">20,253,160</td>
<td char="." align="char">324,544</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">2,176,571</td>
<td char="." align="char">521,109</td>
<td char="." align="char">
<italic>23.9</italic>
</td>
<td char="." align="char">284,872</td>
<td char="." align="char">
<italic>13.1</italic>
</td>
<td char="." align="char">521,358</td>
<td char="." align="char">15,583,737</td>
<td char="." align="char">273,602</td>
</tr>
<tr>
<td align="left">Average</td>
<td align="left"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char">
<italic>23.3</italic>
</td>
<td char="." align="char"></td>
<td char="." align="char">
<italic>14.0</italic>
</td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Statistics about the distribution of the subdomains of
<italic>env</italic>
and
<italic>lab</italic>
in English are presented in Table 
<xref rid="Tab4" ref-type="table">4</xref>
. The distributions for the Greek and French collections are similar, so we do not present them here. The main observation is that the collections are biased to specific subdomains. For example, “labour market” and “labour law and labour relations” cover 28.62 % and 25.68 % of the English
<italic>lab</italic>
data, respectively. This is due to the popularity of these subdomains in comparison with the rest, as well as the fact that the crawler’s goal was to acquire in-domain web pages without a requirement to build corpora balanced equally across subdomains. Another observation is that many documents were classified as parts of two subdomains. For example, 38.09 % of the documents in the English
<italic>env</italic>
collection were categorised in both “deterioration of the environment” and “natural environment”. This is explained by the fact that many terms of the domain definition were assigned to more than one subdomain. In addition, many crawled pages contain data relevant to these neighbouring subdomains.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>Distribution of subdomains in the monolingual English data crawled for the
<italic>env</italic>
and
<italic>lab</italic>
domains</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left">Ratio (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">
<italic>Environment</italic>
</td>
<td align="left"></td>
</tr>
<tr>
<td align="left">Deterioration of the environment; natural environment</td>
<td char="." align="char">38.09</td>
</tr>
<tr>
<td align="left">Natural environment</td>
<td char="." align="char">35.63</td>
</tr>
<tr>
<td align="left">Environmental policy; natural environment</td>
<td char="." align="char">8.60</td>
</tr>
<tr>
<td align="left">Energy policy</td>
<td char="." align="char">4.10</td>
</tr>
<tr>
<td align="left">Deterioration of the environment; environmental policy; natural environment</td>
<td char="." align="char">3.34</td>
</tr>
<tr>
<td align="left">Deterioration of the environment</td>
<td char="." align="char">2.64</td>
</tr>
<tr>
<td align="left">Environmental policy</td>
<td char="." align="char">2.61</td>
</tr>
<tr>
<td align="left">Cultivation of agricultural land</td>
<td char="." align="char">2.28</td>
</tr>
<tr>
<td align="left">Deterioration of the environment; environmental policy</td>
<td char="." align="char">2.16</td>
</tr>
<tr>
<td align="left">
<italic>Unknown</italic>
</td>
<td char="." align="char">0.56</td>
</tr>
<tr>
<td align="left">Total</td>
<td char="." align="char">100.00</td>
</tr>
<tr>
<td align="left" colspan="2">
<italic>Labour legislation</italic>
</td>
</tr>
<tr>
<td align="left">Labour market</td>
<td char="." align="char">28.62</td>
</tr>
<tr>
<td align="left">Labour law and labour relations</td>
<td char="." align="char">25.68</td>
</tr>
<tr>
<td align="left">Organisation of work and working conditions</td>
<td char="." align="char">12.46</td>
</tr>
<tr>
<td align="left">Labour market; organisation of work and working conditions</td>
<td char="." align="char">6.76</td>
</tr>
<tr>
<td align="left">Labour law and labour relations; labour market</td>
<td char="." align="char">5.46</td>
</tr>
<tr>
<td align="left">Employment</td>
<td char="." align="char">4.26</td>
</tr>
<tr>
<td align="left">Employment; labour market</td>
<td char="." align="char">3.59</td>
</tr>
<tr>
<td align="left">Labour law and labour relations; organisation of work and working conditions</td>
<td char="." align="char">3.40</td>
</tr>
<tr>
<td align="left">Personnel management and staff remuneration</td>
<td char="." align="char">3.05</td>
</tr>
<tr>
<td align="left">Labour market; personnel management and staff remuneration</td>
<td char="." align="char">2.76</td>
</tr>
<tr>
<td align="left">
<italic>Unknown</italic>
</td>
<td char="." align="char">2.07</td>
</tr>
<tr>
<td align="left">Labour law and labour relations; labour market; organisation of work and working conditions</td>
<td char="." align="char">1.90</td>
</tr>
<tr>
<td align="left">Total</td>
<td char="." align="char">100.00</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
<sec id="Sec9">
<title>Acquisition of parallel texts</title>
<p>We now describe the procedure for acquisition of parallel data. To this end, we used the focused crawler's bilingual mode of operation (FBC), which is also available as a web service.
<xref ref-type="fn" rid="Fn14">14</xref>
Apart from the components for monolingual data acquisition (normalisation, language identification, cleaning, text classification and deduplication), this mode integrates a component for detection of parallel web pages, as illustrated in Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
.
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>The entire workflow of parallel data acquisition resulting in training, development, and test sets</p>
</caption>
<graphic xlink:href="10579_2014_9282_Fig2_HTML" id="MO5"></graphic>
</fig>
</p>
<p>To guide FBC we used bilingual domain definitions, which consisted of the union of monolingual domain definitions in the targeted languages for the selected domain. In order to construct the list of seed URLs, we manually selected web pages that were collected during the monolingual crawls and originated from in-domain multilingual web sites. We then initialised the crawler with these URLs and forced the crawler to follow only links internal to these sites. By adopting the same crawling strategy mentioned in the previous subsection, FBC follows the most promising links and continues crawling the web site until no more internal links can be extracted.</p>
<p>After downloading in-domain pages from the selected web sites, we employed Bitextor to identify pairs of pages that could be considered translations of each other. Specifically, for each candidate pair of pages, we examine the relative difference in file size, the relative difference in length of plain text, the edit distance of web page fingerprints constructed on the basis of HTML tags, and the edit distance of the lists of numbers in the documents. If all measures are below corresponding thresholds, as those are defined in the default configuration of Bitextor, the pair under consideration is considered a pair of parallel pages. The amount of the acquired in-domain bilingual data is reported in columns 3 and 4 of Table 
<xref rid="Tab5" ref-type="table">5</xref>
.</p>
</sec>
<sec id="Sec10">
<title>Extraction of parallel sentences</title>
<p>After identification of parallel pages, the next steps of the procedure aim at extraction of parallel sentences, i.e., sentence pairs that are likely to be mutual translations. For each document pair free of boilerplate paragraphs, we applied the following steps: identification of sentence boundaries by the Europarl sentence splitter, tokenisation by the Europarl tokeniser (Koehn
<xref ref-type="bibr" rid="CR36">2005</xref>
), and sentence alignment by Hunalign (Varga et al.
<xref ref-type="bibr" rid="CR67">2005</xref>
). Hunalign implements a heuristic, language-independent method for identification of parallel sentences in parallel texts, which can be improved by providing an external bilingual dictionary of word forms. If no such dictionary is provided, Hunalign builds it automatically from the data to be aligned. Without having such (external) dictionaries for EN–FR and EN–EL at hand, we obtained them by applying Hunalign to realign Europarl data in these languages. The resulting dictionaries were consequently used to improve sentence alignment of our in-domain data.</p>
<p>For each sentence pair identified as parallel, Hunalign provides a confidence score, which reflects the level of parallelism, i.e., the degree to which the sentences are mutual translations. We manually investigated a sample of sentence pairs extracted by Hunalign from the data pool for each domain and language pair (about 50 sentence pairs for each language pair and domain), by relying on the judgement of native speakers, and estimated that sentence pairs with a score above 0.4 are of sufficient translation quality. In the next step, we kept sentence pairs with 1:1 alignment only (one sentence on each side) and removed those with scores below this threshold. Finally, we also removed duplicate sentence pairs.</p>
<p>The statistics from the parallel data acquisition procedure are displayed in Table 
<xref rid="Tab5" ref-type="table">5</xref>
. An average of 84 % of source sentences extracted from the parallel documents were aligned 1:1, 10 % of these were then removed due to low estimated translation quality, and after discarding duplicate sentence pairs we ended up with 73 % of the original source sentences aligned to their target sides.</p>
</sec>
<sec id="Sec11">
<title>Manual correction of test sentence pairs</title>
<p>The translation quality of a PB-SMT system built using the parallel sentences obtained by the procedure described above might not be optimal. Tuning the procedure and focusing on high-quality translations is possible, but leads to a trade-off between quality and quantity. For translation model training, high translation quality of the data is less essential than for tuning and testing. Bad phrase pairs can be removed from the SMT translation tables according, for example, to significance testing (Johnson et al.
<xref ref-type="bibr" rid="CR32">2007</xref>
). However, a development set containing sentence pairs that are not good translations of each other might lead to sub-optimal values of model weights, which would significantly harm system performance. If such sentences are used in the test set, the evaluation would be unreliable.</p>
<p>In order to create reliable test and development sets for each language pair and domain, we performed the following low-cost procedure. From the data obtained by the steps described in Sect. 
<xref rid="Sec10" ref-type="sec">3.3</xref>
, we selected a random sample of 3,600 sentence pairs (2,700 for EN–EL in the
<italic>lab</italic>
domain, for which less data was available) and asked native speakers to check and correct them. All 4 evaluators (2 for each language) were researchers with postgraduate education and significant experience in evaluation for NLP tasks. The task consisted of checking that the sentence pairs belonged to the right domain, the sentences within a sentence pair were equivalent in terms of content, and the translation quality was adequate and if not, correcting it.
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>Statistics from the parallel data acquisition procedure: websites used to crawl the data from (
<italic>sites</italic>
), total document pairs (
<italic>documents</italic>
), source-side sentences (
<italic>sentences all</italic>
), aligned sentence pairs (
<italic>paired</italic>
), those of sufficient translation quality (
<italic>good</italic>
); after duplicate removal (
<italic>unique</italic>
); sentences randomly selected for manual correction (
<italic>sampled</italic>
) and those manually validated and (if necessary) corrected (
<italic>corrected</italic>
); details in Table 
<xref rid="Tab6" ref-type="table">6</xref>
</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Language pair</th>
<th align="left">Domain</th>
<th align="left">Sites</th>
<th align="left">Documents</th>
<th align="left">Sentences all</th>
<th align="left">Sentences paired</th>
<th align="left">(
<inline-formula id="IEq40">
<alternatives>
<tex-math id="M79">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta \%$$\end{document}</tex-math>
<mml:math id="M80">
<mml:mrow>
<mml:mi mathvariant="italic">Δ</mml:mi>
<mml:mo>%</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq40.gif"></inline-graphic>
</alternatives>
</inline-formula>
)</th>
<th align="left">Sentences good</th>
<th align="left">(
<inline-formula id="IEq41">
<alternatives>
<tex-math id="M81">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta \%$$\end{document}</tex-math>
<mml:math id="M82">
<mml:mrow>
<mml:mi mathvariant="italic">Δ</mml:mi>
<mml:mo>%</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq41.gif"></inline-graphic>
</alternatives>
</inline-formula>
)</th>
<th align="left">Sentences unique</th>
<th align="left">(
<inline-formula id="IEq42">
<alternatives>
<tex-math id="M83">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta \%$$\end{document}</tex-math>
<mml:math id="M84">
<mml:mrow>
<mml:mi mathvariant="italic">Δ</mml:mi>
<mml:mo>%</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq42.gif"></inline-graphic>
</alternatives>
</inline-formula>
)</th>
<th align="left">Sentences sampled</th>
<th align="left">Sentences corrected</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="2">English–French</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">6</td>
<td char="." align="char">559</td>
<td char="." align="char">19,042</td>
<td char="." align="char">14,881</td>
<td char="." align="char">
<italic>78.1</italic>
</td>
<td char="." align="char">14,079</td>
<td char="." align="char">
<italic>73.9</italic>
</td>
<td char="." align="char">13,840</td>
<td char="." align="char">
<italic>72.7</italic>
</td>
<td char="." align="char">3,600</td>
<td char="." align="char">3,392</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">4</td>
<td char="." align="char">900</td>
<td char="." align="char">35,870</td>
<td char="." align="char">31,541</td>
<td char="." align="char">
<italic>87.9</italic>
</td>
<td char="." align="char">27,601</td>
<td char="." align="char">
<italic>76.9</italic>
</td>
<td char="." align="char">23,861</td>
<td char="." align="char">
<italic>66.5</italic>
</td>
<td char="." align="char">3,600</td>
<td char="." align="char">3,411</td>
</tr>
<tr>
<td align="left" rowspan="2">English–Greek</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">14</td>
<td char="." align="char">288</td>
<td char="." align="char">17,033</td>
<td char="." align="char">14,846</td>
<td char="." align="char">
<italic>87.2</italic>
</td>
<td char="." align="char">14,028</td>
<td char="." align="char">
<italic>82.4</italic>
</td>
<td char="." align="char">13,253</td>
<td char="." align="char">
<italic>77.8</italic>
</td>
<td char="." align="char">3,600</td>
<td char="." align="char">3,000</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">7</td>
<td char="." align="char">203</td>
<td char="." align="char">13,169</td>
<td char="." align="char">11,006</td>
<td char="." align="char">
<italic>83.6</italic>
</td>
<td char="." align="char">9,904</td>
<td char="." align="char">
<italic>75.2</italic>
</td>
<td char="." align="char">9,764</td>
<td char="." align="char">
<italic>74.1</italic>
</td>
<td char="." align="char">2,700</td>
<td char="." align="char">2,506</td>
</tr>
<tr>
<td align="left">Average</td>
<td align="left"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char">
<italic>84.2</italic>
</td>
<td char="." align="char"></td>
<td char="." align="char">
<italic>77.1</italic>
</td>
<td char="." align="char"></td>
<td char="." align="char">
<italic>72.8</italic>
</td>
<td char="." align="char"></td>
<td char="." align="char"></td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="Tab6">
<label>Table 6</label>
<caption>
<p>Statistics (%) of manual correction of a sample of parallel sentences extracted by Hunalign</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left">EN–EL/
<italic>env</italic>
</th>
<th align="left">EN–FR/
<italic>lab</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">1. Perfect translation</td>
<td char="." align="char">53.49</td>
<td char="." align="char">72.23</td>
</tr>
<tr>
<td align="left">2. Minor corrections done</td>
<td char="." align="char">34.15</td>
<td char="." align="char">21.99</td>
</tr>
<tr>
<td align="left">3. Major corrections needed</td>
<td char="." align="char">3.00</td>
<td char="." align="char">0.33</td>
</tr>
<tr>
<td align="left">4. Misaligned sentence pair</td>
<td char="." align="char">5.09</td>
<td char="." align="char">1.58</td>
</tr>
<tr>
<td align="left">5. Wrong domain</td>
<td char="." align="char">4.28</td>
<td char="." align="char">3.86</td>
</tr>
<tr>
<td align="left">Total</td>
<td char="." align="char">100.00</td>
<td char="." align="char">100.00</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Our goal was to obtain at least 3,000 correct sentence pairs (2,000 test pairs and 1,000 development pairs) for each domain and language pair. Accordingly, in order to speed up the process, we did not instruct the correctors to amend every sentence pair, but rather allowed them to skip (remove) any sentence pairs that were misaligned. In addition, we asked them to remove those sentence pairs that were obviously from a very different domain (despite being correct translations). The number of manually verified and (if necessary) corrected sentence pairs is presented in the last column in Table 
<xref rid="Tab5" ref-type="table">5</xref>
.</p>
<p>According to the human judgements, 53–72 % of sentence pairs were accurate translations, 22–34 % needed only minor corrections, 1–3 % would require major corrections (which was not necessary, as the accurate sentence pairs together with those requiring minor corrections were enough to reach our goal of at least 3,000 sentence pairs), 2–5 % of sentence pairs were misaligned and would have had to be translated completely (which was not necessary in most cases), and about 4 % of sentence pairs were from a different domain (though correct translations). Detailed statistics collected during the corrections are presented in Table 
<xref rid="Tab6" ref-type="table">6</xref>
.</p>
<p>In the next step, we selected 2,000 pairs from the corrected sentences for the test set and left the remaining part for the development set. Those parallel sentences which were not sampled for the correction phase were added to the training sets. The correctors confirmed that the manual corrections were about 5–10 times faster than translating the sentences from scratch, so this can be viewed as a low-cost method for acquiring in-domain test and development sets for MT. Further statistics of all parallel data sets are given in Table 
<xref rid="Tab7" ref-type="table">7</xref>
. The data sets are available from ELRA under reference numbers ELRA-W0057 and ELRA-W0058.
<table-wrap id="Tab7">
<label>Table 7</label>
<caption>
<p>Statistics of the domain-specific parallel data sets obtained by web crawling and manual correction</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Language pair (L1–L2)</th>
<th align="left">Domain</th>
<th align="left">Set</th>
<th align="left">Corrected</th>
<th align="left">Sentence pairs</th>
<th align="left">L1 tokens</th>
<th align="left">L1 vocabulary</th>
<th align="left">L2 tokens</th>
<th align="left">L2 vocabulary</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="8">English–French</td>
<td align="left" rowspan="4">
<italic>env</italic>
</td>
<td align="left">Train</td>
<td align="left">No</td>
<td char="." align="char">10,240</td>
<td char="." align="char">300,760</td>
<td char="." align="char">10,963</td>
<td char="." align="char">362,899</td>
<td char="." align="char">14,209</td>
</tr>
<tr>
<td align="left">Dev</td>
<td align="left">Yes</td>
<td char="." align="char">1,392</td>
<td char="." align="char">41,382</td>
<td char="." align="char">4,660</td>
<td char="." align="char">49,657</td>
<td char="." align="char">5,542</td>
</tr>
<tr>
<td align="left">Dev
<sup>
<italic>raw</italic>
</sup>
</td>
<td align="left">No</td>
<td char="." align="char">1,458</td>
<td char="." align="char">42,414</td>
<td char="." align="char">4,754</td>
<td char="." align="char">50,965</td>
<td char="." align="char">5,700</td>
</tr>
<tr>
<td align="left">Test</td>
<td align="left">Yes</td>
<td char="." align="char">2,000</td>
<td char="." align="char">58,865</td>
<td char="." align="char">5,483</td>
<td char="." align="char">70,740</td>
<td char="." align="char">6,617</td>
</tr>
<tr>
<td align="left" rowspan="4">
<italic>lab</italic>
</td>
<td align="left">Train</td>
<td align="left">No</td>
<td char="." align="char">20,261</td>
<td char="." align="char">709,893</td>
<td char="." align="char">12,746</td>
<td char="." align="char">836,634</td>
<td char="." align="char">17,139</td>
</tr>
<tr>
<td align="left">Dev</td>
<td align="left">Yes</td>
<td char="." align="char">1,411</td>
<td char="." align="char">52,156</td>
<td char="." align="char">4,478</td>
<td char="." align="char">61,191</td>
<td char="." align="char">5,535</td>
</tr>
<tr>
<td align="left">Dev
<sup>
<italic>raw</italic>
</sup>
</td>
<td align="left">No</td>
<td char="." align="char">1,498</td>
<td char="." align="char">54,024</td>
<td char="." align="char">4,706</td>
<td char="." align="char">63,519</td>
<td char="." align="char">5,832</td>
</tr>
<tr>
<td align="left">Test</td>
<td align="left">Yes</td>
<td char="." align="char">2,000</td>
<td char="." align="char">71,688</td>
<td char="." align="char">5,277</td>
<td char="." align="char">84,397</td>
<td char="." align="char">6,630</td>
</tr>
<tr>
<td align="left" rowspan="8">English–Greek</td>
<td align="left" rowspan="4">
<italic>env</italic>
</td>
<td align="left">Train</td>
<td align="left">No</td>
<td char="." align="char">9,653</td>
<td char="." align="char">240,822</td>
<td char="." align="char">10,932</td>
<td char="." align="char">267,742</td>
<td char="." align="char">20,185</td>
</tr>
<tr>
<td align="left">Dev</td>
<td align="left">Yes</td>
<td char="." align="char">1,000</td>
<td char="." align="char">27,865</td>
<td char="." align="char">3,586</td>
<td char="." align="char">30,510</td>
<td char="." align="char">5,467</td>
</tr>
<tr>
<td align="left">Dev
<sup>
<italic>raw</italic>
</sup>
</td>
<td align="left">No</td>
<td char="." align="char">1,134</td>
<td char="." align="char">32,588</td>
<td char="." align="char">3,967</td>
<td char="." align="char">35,446</td>
<td char="." align="char">6,137</td>
</tr>
<tr>
<td align="left">Test</td>
<td align="left">Yes</td>
<td char="." align="char">2,000</td>
<td char="." align="char">58,073</td>
<td char="." align="char">4,893</td>
<td char="." align="char">63,551</td>
<td char="." align="char">8,229</td>
</tr>
<tr>
<td align="left" rowspan="4">
<italic>lab</italic>
</td>
<td align="left">Train</td>
<td align="left">No</td>
<td char="." align="char">7,064</td>
<td char="." align="char">233,145</td>
<td char="." align="char">7,136</td>
<td char="." align="char">244,396</td>
<td char="." align="char">14,456</td>
</tr>
<tr>
<td align="left">Dev</td>
<td align="left">Yes</td>
<td char="." align="char">506</td>
<td char="." align="char">15,129</td>
<td char="." align="char">2,227</td>
<td char="." align="char">16,089</td>
<td char="." align="char">3,333</td>
</tr>
<tr>
<td align="left">Dev
<sup>
<italic>raw</italic>
</sup>
</td>
<td align="left">No</td>
<td char="." align="char">547</td>
<td char="." align="char">17,027</td>
<td char="." align="char">2,386</td>
<td char="." align="char">18,172</td>
<td char="." align="char">3,620</td>
</tr>
<tr>
<td align="left">Test</td>
<td align="left">Yes</td>
<td char="." align="char">2,000</td>
<td char="." align="char">62,953</td>
<td char="." align="char">4,022</td>
<td char="." align="char">66,770</td>
<td char="." align="char">7,056</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
</sec>
<sec id="Sec12">
<title>Baseline translation system</title>
<p>We now present our experimental set-up, the baseline (general-domain) system and its performance. Our primary evaluation measure is BLEU (Papineni et al.
<xref ref-type="bibr" rid="CR50">2002</xref>
) always reported as percentages. For detailed analysis, we also present PER (Tillmann et al.
<xref ref-type="bibr" rid="CR65">1997</xref>
) and TER (Snover et al.
<xref ref-type="bibr" rid="CR61">2006</xref>
) in Tables 
<xref rid="Tab17" ref-type="table">17</xref>
<xref rid="Tab20" ref-type="table">20</xref>
. The latter two are error rates, so the lower the score the better. In this paper, however, we report the scores as
<inline-formula id="IEq47">
<alternatives>
<tex-math id="M85">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1-\hbox {PER})\times 100$$\end{document}</tex-math>
<mml:math id="M86">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mtext>PER</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>×</mml:mo>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq47.gif"></inline-graphic>
</alternatives>
</inline-formula>
and
<inline-formula id="IEq48">
<alternatives>
<tex-math id="M87">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(1-\hbox {TER})\times 100$$\end{document}</tex-math>
<mml:math id="M88">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>-</mml:mo>
<mml:mtext>TER</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>×</mml:mo>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq48.gif"></inline-graphic>
</alternatives>
</inline-formula>
respectively so that all metrics are in the range 0–100 where higher scores indicate better translations.</p>
<sec id="Sec13">
<title>System description</title>
<p>Our SMT system is MaTrEx (Penkale et al.
<xref ref-type="bibr" rid="CR54">2010</xref>
), a combination-based multi-engine architecture developed at Dublin City University. The architecture includes various individual systems: phrase-based, example-based, hierarchical phrase-based, and tree-based MT. In this work, we only exploit the phrase-based component, which is based on Moses (Koehn et al.
<xref ref-type="bibr" rid="CR39">2007</xref>
), an open-source toolkit for SMT.</p>
<p>For training, all data sets are tokenised and lowercased using the Europarl tools.
<xref ref-type="fn" rid="Fn15">15</xref>
The original (non-lowercased) target side of the parallel data is kept for training the Moses recaser. The lowercased versions of the target side are used for training an interpolated 5-gram language model with Kneser-Ney discounting (Kneser and Ney
<xref ref-type="bibr" rid="CR34">1995</xref>
) using the SRILM toolkit (Stolcke
<xref ref-type="bibr" rid="CR64">2002</xref>
). The parallel training data is lowercased and filtered at the sentence level; we kept all sentence pairs having fewer than 100 words on each side and with the length ratio within the interval
<inline-formula id="IEq49">
<alternatives>
<tex-math id="M89">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\langle 0.11,9.0\rangle $$\end{document}</tex-math>
<mml:math id="M90">
<mml:mrow>
<mml:mo stretchy="false"></mml:mo>
<mml:mn>0.11</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>9.0</mml:mn>
<mml:mo stretchy="false"></mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq49.gif"></inline-graphic>
</alternatives>
</inline-formula>
. The maximum length for aligned phrases is set to seven and the reordering models are generated using the parameters
<italic>distance</italic>
and
<italic>orientation-bidirectional-fe</italic>
. The resulting system combines the 14 feature functions described in Sect.
<xref rid="Sec5" ref-type="sec">2.3</xref>
.</p>
<p>The corresponding parameters are optimised on the development sets by MERT. After running several experiments with MERT, we found out that variance of BLEU caused by parameter optimization is quite low (about
<inline-formula id="IEq50">
<alternatives>
<tex-math id="M91">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\pm 0.25$$\end{document}</tex-math>
<mml:math id="M92">
<mml:mrow>
<mml:mo>±</mml:mo>
<mml:mn>0.25</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq50.gif"></inline-graphic>
</alternatives>
</inline-formula>
and in almost all cases statistically not significant) and due to the high number of experiments, for most systems we tune the parameters only once. In Sect. 
<xref rid="Sec18" ref-type="sec">5.2</xref>
, we analyse the weights assigned by MERT to each parameter in our various experimental set-ups. For decoding, test sentences are also tokenised and lowercased. The evaluation measures are applied on tokenised and lowercased outputs and reference translations. To test statistical significance, we use paired bootstrap resampling for BLEU (Koehn
<xref ref-type="bibr" rid="CR35">2004</xref>
) with
<inline-formula id="IEq51">
<alternatives>
<tex-math id="M93">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p<0.05$$\end{document}</tex-math>
<mml:math id="M94">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo><</mml:mo>
<mml:mn>0.05</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq51.gif"></inline-graphic>
</alternatives>
</inline-formula>
and 10,000 samples. In tables presenting the translation results in the following sections, the best scores for each translation direction and domain, and those which are statistically indistinguishable from the best ones are typed in bold.</p>
</sec>
<sec id="Sec14">
<title>General-domain data</title>
<p>For the baseline general-domain system, we exploited the widely used data provided by the organisers of the SMT workshops (WPT 2005
<xref ref-type="fn" rid="Fn16">16</xref>
– WMT 2011
<xref ref-type="fn" rid="Fn17">17</xref>
): the Europarl parallel corpus (Koehn
<xref ref-type="bibr" rid="CR36">2005</xref>
) as training data for translation and language models, and the WPT 2005 test sets as the development and test data for general-domain tuning and testing, respectively.
<table-wrap id="Tab8">
<label>Table 8</label>
<caption>
<p>Statistics of the general-domain data sets obtained from the Europarl corpus and the WPT workshop</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Language pair (L1–L2)</th>
<th align="left">Domain</th>
<th align="left">Set</th>
<th align="left">Source</th>
<th align="left">Sentence pairs</th>
<th align="left">L1 tokens</th>
<th align="left">L1 vocabulary</th>
<th align="left">L2 tokens</th>
<th align="left">L2 vocabulary</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="3">English–French</td>
<td align="left" rowspan="3">
<italic>gen</italic>
</td>
<td align="left">Train</td>
<td align="left">Europarl 5</td>
<td char="." align="char">1,725,096</td>
<td char="." align="char">47,956,886</td>
<td char="." align="char">73,645</td>
<td char="." align="char">53,262,628</td>
<td char="." align="char">103,436</td>
</tr>
<tr>
<td align="left">Dev</td>
<td align="left">WPT 2005</td>
<td char="." align="char">2,000</td>
<td char="." align="char">58,655</td>
<td char="." align="char">5,734</td>
<td char="." align="char">67,295</td>
<td char="." align="char">6,913</td>
</tr>
<tr>
<td align="left">Test</td>
<td align="left">WPT 2005</td>
<td char="." align="char">2,000</td>
<td char="." align="char">57,951</td>
<td char="." align="char">5,649</td>
<td char="." align="char">66,200</td>
<td char="." align="char">6,876</td>
</tr>
<tr>
<td align="left" rowspan="3">English–Greek</td>
<td align="left" rowspan="3">
<italic>gen</italic>
</td>
<td align="left">Train</td>
<td align="left">Europarl 5</td>
<td char="." align="char">964,242</td>
<td char="." align="char">27,446,726</td>
<td char="." align="char">61,497</td>
<td char="." align="char">27,537,853</td>
<td char="." align="char">173,435</td>
</tr>
<tr>
<td align="left">Dev</td>
<td align="left">WPT 2005</td>
<td char="." align="char">2,000</td>
<td char="." align="char">58,655</td>
<td char="." align="char">5,734</td>
<td char="." align="char">63,349</td>
<td char="." align="char">9,191</td>
</tr>
<tr>
<td align="left">Test</td>
<td align="left">WPT 2005</td>
<td char="." align="char">2,000</td>
<td char="." align="char">57,951</td>
<td char="." align="char">5,649</td>
<td char="." align="char">62,332</td>
<td char="." align="char">9,037</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>Europarl is extracted from the proceedings of the European Parliament which covers a number of topics (Koehn
<xref ref-type="bibr" rid="CR36">2005</xref>
), including some related to the domains of our interest. For this reason, we take this corpus as a base for our domain-adaptation experiments and consider it to be general-domain. There is also a practical motivation for doing this: this corpus is relatively large, available for many language pairs, easily accessible for both industry and academia, and can be expected to play the same role in real-world applications. Europarl version 5, released in 2010, comprises texts in 11 European languages including all languages of interest in this work (see Table 
<xref rid="Tab8" ref-type="table">8</xref>
). Note that the amount of parallel data for EN–EL is only about half of what is available for EN–FR. Furthermore, Greek morphology is more complex than French morphology so the Greek vocabulary size (we count unique lowercased alphabetical tokens) is much larger than the French one. The WPT 2005 development and test sets contain 2,000 sentence pairs each, available in the same languages as Europarl provided by the WPT 2005 organisers as development and test sets for the translation shared task (later WMT test sets do not include Greek data). All data sets used in our experiments contain a single reference translation.</p>
</sec>
<sec id="Sec15">
<title>Baseline system evaluation</title>
<p>A number of previously published experiments (e.g., Wu et al.
<xref ref-type="bibr" rid="CR69">2008</xref>
; Banerjee et al.
<xref ref-type="bibr" rid="CR3">2010</xref>
) reported significant degradation in translation quality when an SMT system was applied to out-of-domain data. In order to verify this observation, we compare the performance of the baseline system (trained and tuned on general-domain data) on all our test sets: general-domain (
<italic>gen</italic>
) and domain-specific (
<italic>env</italic>
,
<italic>lab</italic>
). We present the results in Table 
<xref rid="Tab9" ref-type="table">9</xref>
.
<table-wrap id="Tab9">
<label>Table 9</label>
<caption>
<p>Performance comparison of the baseline systems (
<italic>B0</italic>
) tested on general (
<italic>gen</italic>
) and specific (
<italic>env</italic>
,
<italic>lab</italic>
) domains</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="2">Direction</th>
<th align="left" colspan="3">General</th>
<th align="left" colspan="4">Environment</th>
<th align="left" colspan="4">Labour legislation</th>
</tr>
<tr>
<th align="left">BLEU</th>
<th align="left">OOV</th>
<th align="left">PPL</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq53">
<alternatives>
<tex-math id="M95">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M96">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq53.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">OOV</th>
<th align="left">PPL</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq54">
<alternatives>
<tex-math id="M97">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M98">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq54.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">OOV</th>
<th align="left">PPL</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">English–French</td>
<td char="." align="char">52.57</td>
<td char="." align="char">0.11</td>
<td char="." align="char">28.1</td>
<td char="." align="char">29.61</td>
<td align="left">
<inline-formula id="IEq55">
<alternatives>
<tex-math id="M99">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}</tex-math>
<mml:math id="M100">
<mml:mo>-</mml:mo>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq55.gif"></inline-graphic>
</alternatives>
</inline-formula>
22.96</td>
<td char="." align="char">0.98</td>
<td char="." align="char">67.8</td>
<td char="." align="char">23.94</td>
<td align="left">
<inline-formula id="IEq56">
<alternatives>
<tex-math id="M101">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}</tex-math>
<mml:math id="M102">
<mml:mo>-</mml:mo>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq56.gif"></inline-graphic>
</alternatives>
</inline-formula>
28.63</td>
<td char="." align="char">0.85</td>
<td char="." align="char">83.2</td>
</tr>
<tr>
<td align="left">French–English</td>
<td char="." align="char">57.16</td>
<td char="." align="char">0.11</td>
<td char="." align="char">32.0</td>
<td char="." align="char">31.79</td>
<td align="left">
<inline-formula id="IEq57">
<alternatives>
<tex-math id="M103">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}</tex-math>
<mml:math id="M104">
<mml:mo>-</mml:mo>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq57.gif"></inline-graphic>
</alternatives>
</inline-formula>
25.37</td>
<td char="." align="char">0.81</td>
<td char="." align="char">122.0</td>
<td char="." align="char">26.96</td>
<td align="left">
<inline-formula id="IEq58">
<alternatives>
<tex-math id="M105">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}</tex-math>
<mml:math id="M106">
<mml:mo>-</mml:mo>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq58.gif"></inline-graphic>
</alternatives>
</inline-formula>
30.20</td>
<td char="." align="char">0.68</td>
<td char="." align="char">153.6</td>
</tr>
<tr>
<td align="left">English–Greek</td>
<td char="." align="char">42.52</td>
<td char="." align="char">0.22</td>
<td char="." align="char">130.0</td>
<td char="." align="char">21.20</td>
<td align="left">
<inline-formula id="IEq59">
<alternatives>
<tex-math id="M107">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}</tex-math>
<mml:math id="M108">
<mml:mo>-</mml:mo>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq59.gif"></inline-graphic>
</alternatives>
</inline-formula>
21.32</td>
<td char="." align="char">1.15</td>
<td char="." align="char">119.8</td>
<td char="." align="char">24.04</td>
<td align="left">
<inline-formula id="IEq60">
<alternatives>
<tex-math id="M109">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}</tex-math>
<mml:math id="M110">
<mml:mo>-</mml:mo>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq60.gif"></inline-graphic>
</alternatives>
</inline-formula>
18.48</td>
<td char="." align="char">0.47</td>
<td char="." align="char">82.1</td>
</tr>
<tr>
<td align="left">Greek–English</td>
<td char="." align="char">44.30</td>
<td char="." align="char">0.56</td>
<td char="." align="char">36.0</td>
<td char="." align="char">29.31</td>
<td align="left">
<inline-formula id="IEq61">
<alternatives>
<tex-math id="M111">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}</tex-math>
<mml:math id="M112">
<mml:mo>-</mml:mo>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq61.gif"></inline-graphic>
</alternatives>
</inline-formula>
14.99</td>
<td char="." align="char">1.53</td>
<td char="." align="char">115.4</td>
<td char="." align="char">31.73</td>
<td align="left">
<inline-formula id="IEq62">
<alternatives>
<tex-math id="M113">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}</tex-math>
<mml:math id="M114">
<mml:mo>-</mml:mo>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq62.gif"></inline-graphic>
</alternatives>
</inline-formula>
12.57</td>
<td char="." align="char">0.69</td>
<td char="." align="char">74.9</td>
</tr>
<tr>
<td align="left">Average</td>
<td char="." align="char"></td>
<td char="." align="char">0.25</td>
<td char="." align="char">56.6</td>
<td char="." align="char"></td>
<td align="left">
<inline-formula id="IEq63">
<alternatives>
<tex-math id="M115">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}</tex-math>
<mml:math id="M116">
<mml:mo>-</mml:mo>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq63.gif"></inline-graphic>
</alternatives>
</inline-formula>
21.16</td>
<td char="." align="char">1.12</td>
<td char="." align="char">106.4</td>
<td char="." align="char"></td>
<td align="left">
<inline-formula id="IEq64">
<alternatives>
<tex-math id="M117">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-$$\end{document}</tex-math>
<mml:math id="M118">
<mml:mo>-</mml:mo>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq64.gif"></inline-graphic>
</alternatives>
</inline-formula>
22.47</td>
<td char="." align="char">0.67</td>
<td char="." align="char">98.5</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<inline-formula id="IEq52">
<alternatives>
<tex-math id="M119">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M120">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq52.gif"></inline-graphic>
</alternatives>
</inline-formula>
refers to the change in BLEU score over the
<italic>gen</italic>
domain, OOV to the out-of-vocabulary rate (%) of the test sentences, and PPL to perplexity of the reference translations given the target-side language models</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>The BLEU scores obtained on the general domain test sets are quite high—they range from 42.52 to 57.16 points. This is caused by the fact that the development and test sentence pairs were taken from the same source (proceedings of the European Parliament), where similar expressions and phrases often recur. We found that about 5 % of EN–FR development and test sentence pairs also occur in the training data (although no sentence pair duplicates were found in the EN–EL test sets). The duplicates were probably added to later versions of Europarl after the WPT 2005 test sets were released, but this does not affect our domain-adaptation experiments presented in this paper.</p>
<p>Switching from general-domain to domain-specific test sets yields an average absolute decrease of 21.16 BLEU points (48.22 % relative) on the
<italic>env</italic>
domain and 22.47 BLEU points (44.84 % relative) on the
<italic>lab</italic>
domain (see columns denoted by
<inline-formula id="IEq65">
<alternatives>
<tex-math id="M121">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M122">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq65.gif"></inline-graphic>
</alternatives>
</inline-formula>
in Table 
<xref rid="Tab9" ref-type="table">9</xref>
). Although the magnitude of the decrease might be a little overestimated (due to the occurrence of a portion of the
<italic>gen</italic>
test data in the training data), the drop in translation quality is evident. It is caused by the divergence of training and test data, which is also illustrated by the increase of the OOV rates (ratios of untranslated words) and perplexity (PPL) of the reference translations of the test sets given language models trained on the target side of the parallel training data (this reflects how well the language model reflects the characteristics of the target language). For both measures, lower scores indicate a better fit.</p>
<p>The OOV rate increases from an average of 0.25 % on the
<italic>gen</italic>
domain to 1.12 % on the
<italic>env</italic>
domain and 0.67 % on the
<italic>lab</italic>
domain, and the average perplexity increased from 56.6 on the
<italic>gen</italic>
domain to 106.4 on the
<italic>env</italic>
domain and 98.5 on the
<italic>lab</italic>
domain (see Table 
<xref rid="Tab9" ref-type="table">9</xref>
). It almost doubles when going from general (
<italic>gen</italic>
) to specific (
<italic>env</italic>
,
<italic>lab</italic>
) domain and makes scoring of hypotheses during decoding difficult. An interesting case is the EN–EL translation direction, where the highest perplexity is surprisingly achieved on the
<italic>gen</italic>
domain. This is probably due to the morphological complexity of the target language and nature of the particular test set. After a thorough analysis of the Greek side of this test set, we discovered some inconsistency in tokenisation (introduced by the providers of the data) which contributed to the higher PPL value. This does not, however, influence the findings in this work. In all other cases, perplexity increases for domain-specific data.</p>
</sec>
</sec>
<sec id="Sec16">
<title>Domain adaptation by parameter tuning</title>
<p>Optimisation of the log-linear combination parameters, which most modern SMT systems are based on, is known to have a big influence on translation performance. A sensible first step towards domain adaptation of a general-domain system is to use in-domain development data. Such data usually comprises a small set of parallel sentences which are repeatedly translated until the model parameters are adjusted to their optimal values.
<xref ref-type="fn" rid="Fn18">18</xref>
<table-wrap id="Tab10">
<label>Table 10</label>
<caption>
<p>Parameter tuning of the baseline (general-domain-trained) systems on various development data: general-domain (
<italic>B0</italic>
), corrected in-domain sentences (
<italic>P1</italic>
), raw in-domain sentences (
<italic>P2</italic>
), cross-domain data (
<italic>P3</italic>
), and by using the default weights (
<italic>P4</italic>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="2">Direction</th>
<th align="left" rowspan="2">Test</th>
<th align="left">General (
<italic>B0</italic>
)</th>
<th align="left" colspan="2">In-domain (
<italic>P1</italic>
)</th>
<th align="left" colspan="2">In-domain
<sup>raw</sup>
(
<italic>P2</italic>
)</th>
<th align="left" colspan="2">Cross-domain (
<italic>P3</italic>
)</th>
<th align="left" colspan="2">Default (
<italic>P4</italic>
)</th>
</tr>
<tr>
<th align="left">BLEU</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq68">
<alternatives>
<tex-math id="M123">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M124">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq68.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq69">
<alternatives>
<tex-math id="M125">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M126">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq69.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq70">
<alternatives>
<tex-math id="M127">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M128">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq70.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq71">
<alternatives>
<tex-math id="M129">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M130">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq71.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="2">English–French</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">29.61</td>
<td char="." align="char">
<bold>37.51</bold>
</td>
<td char="." align="char">
<bold>7.90</bold>
</td>
<td char="." align="char">37.25</td>
<td char="." align="char">7.64</td>
<td char="." align="char">
<bold>37.47</bold>
</td>
<td char="." align="char">
<bold>7.86</bold>
</td>
<td char="." align="char">36.24</td>
<td char="." align="char">6.63</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">23.94</td>
<td char="." align="char">
<bold>32.15</bold>
</td>
<td char="." align="char">
<bold>8.21</bold>
</td>
<td char="." align="char">31.88</td>
<td char="." align="char">7.94</td>
<td char="." align="char">31.82</td>
<td char="." align="char">7.88</td>
<td char="." align="char">30.60</td>
<td char="." align="char">6.66</td>
</tr>
<tr>
<td align="left" rowspan="2">French–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">31.79</td>
<td char="." align="char">
<bold>39.05</bold>
</td>
<td char="." align="char">
<bold>7.26</bold>
</td>
<td char="." align="char">
<bold>38.93</bold>
</td>
<td char="." align="char">
<bold>7.14</bold>
</td>
<td char="." align="char">
<bold>38.79</bold>
</td>
<td char="." align="char">
<bold>7.00</bold>
</td>
<td char="." align="char">34.05</td>
<td char="." align="char">2.26</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">26.96</td>
<td char="." align="char">
<bold>33.48</bold>
</td>
<td char="." align="char">
<bold>6.52</bold>
</td>
<td char="." align="char">
<bold>33.34</bold>
</td>
<td char="." align="char">
<bold>6.38</bold>
</td>
<td char="." align="char">33.07</td>
<td char="." align="char">6.11</td>
<td char="." align="char">29.69</td>
<td char="." align="char">2.73</td>
</tr>
<tr>
<td align="left" rowspan="2">English–Greek</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">21.20</td>
<td char="." align="char">
<bold>27.56</bold>
</td>
<td char="." align="char">
<bold>6.36</bold>
</td>
<td char="." align="char">27.29</td>
<td char="." align="char">6.09</td>
<td char="." align="char">27.26</td>
<td char="." align="char">6.06</td>
<td char="." align="char">27.16</td>
<td char="." align="char">5.96</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">24.04</td>
<td char="." align="char">
<bold>30.07</bold>
</td>
<td char="." align="char">
<bold>6.03</bold>
</td>
<td char="." align="char">
<bold>30.23</bold>
</td>
<td char="." align="char">
<bold>6.19</bold>
</td>
<td char="." align="char">29.68</td>
<td char="." align="char">5.64</td>
<td char="." align="char">29.76</td>
<td char="." align="char">5.72</td>
</tr>
<tr>
<td align="left" rowspan="2">Greek–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">29.31</td>
<td char="." align="char">
<bold>34.31</bold>
</td>
<td char="." align="char">
<bold>5.00</bold>
</td>
<td char="." align="char">
<bold>34.32</bold>
</td>
<td char="." align="char">
<bold>5.01</bold>
</td>
<td char="." align="char">33.98</td>
<td char="." align="char">4.67</td>
<td char="." align="char">31.45</td>
<td char="." align="char">2.14</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">31.73</td>
<td char="." align="char">
<bold>37.57</bold>
</td>
<td char="." align="char">
<bold>5.84</bold>
</td>
<td char="." align="char">
<bold>37.68</bold>
</td>
<td char="." align="char">
<bold>5.95</bold>
</td>
<td char="." align="char">
<bold>37.58</bold>
</td>
<td char="." align="char">
<bold>5.85</bold>
</td>
<td char="." align="char">34.95</td>
<td char="." align="char">3.22</td>
</tr>
<tr>
<td align="left">Average</td>
<td align="left"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char">6.64</td>
<td char="." align="char"></td>
<td char="." align="char">6.54</td>
<td char="." align="char"></td>
<td char="." align="char">6.38</td>
<td char="." align="char"></td>
<td char="." align="char">4.42</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<inline-formula id="IEq66">
<alternatives>
<tex-math id="M131">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M132">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq66.gif"></inline-graphic>
</alternatives>
</inline-formula>
refers to absolute improvement in BLEU over the baseline (
<italic>B0</italic>
)</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>By using the parallel data acquisition procedure described in Sect. 
<xref rid="Sec7" ref-type="sec">3</xref>
, we acquired development sets (506–1,411 sentence pairs depending on the language pair), which prove to be very beneficial for parameter tuning in our experiments (see Table 
<xref rid="Tab10" ref-type="table">10</xref>
). Compared to the baseline systems trained and tuned on general-domain data only (denoted as
<italic>B0)</italic>
, the systems trained on general-domain data and tuned on in-domain data (denoted as
<italic>P1</italic>
) improve BLEU by 6.64 points absolute (24.82 % relative) on average (compare columns
<italic>B0</italic>
and
<italic>P1</italic>
in Table 
<xref rid="Tab10" ref-type="table">10</xref>
). On the one hand, this behaviour is to be expected, but taking into account that the development sets contain only several hundreds of parallel sentences each, such an improvement is nevertheless significant.</p>
<sec id="Sec17">
<title>Correction of development data</title>
<p>A small amount of manual effort was put into the manual correction of the test as well as development data acquired for the specific domains (see Sect. 
<xref rid="Sec11" ref-type="sec">3.4</xref>
). In order to assess the practical need to correct the development data, we compare baseline systems tuned on manually corrected development sets with systems tuned on raw development sets. This raw development data (denoted by
<inline-formula id="IEq72">
<alternatives>
<tex-math id="M133">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${raw}$$\end{document}</tex-math>
<mml:math id="M134">
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq72.gif"></inline-graphic>
</alternatives>
</inline-formula>
in Table 
<xref rid="Tab7" ref-type="table">7</xref>
) contains not only the sentences with imperfect translation, but also those that are misaligned and/or belong to other domains. As a consequence, the raw development sets contain 5–14 % more sentence pairs than the corrected ones (see Table 
<xref rid="Tab7" ref-type="table">7</xref>
). The performance of the systems tuned using the raw development data is shown in Table 
<xref rid="Tab10" ref-type="table">10</xref>
, column
<italic>P2</italic>
. In general, the absolute differences in BLEU compared to the
<italic>P1</italic>
systems are very small and not statistically significant for most of the scenarios (figures in bold). The average absolute improvement over the baseline system
<italic>B0</italic>
is 6.54 BLEU points, which is only 0.1 points less than the score obtained by the
<italic>P1</italic>
systems. In practice, this finding makes the manual correction of development data acquired by our procedure unnecessary since the results obtained using raw parallel data are comparable.</p>
</sec>
<sec id="Sec18">
<title>Analysis of model parameters</title>
<p>The only things that change when the systems are tuned on in-domain data are the weights of the feature functions in the log-linear combination optimised by MERT. The reordering, language, and translation models all remain untouched, as they are trained on general-domain data. Recall that the parameter space searched through by MERT is large and the error surface highly non-convex, so the resulting weight vectors might not be globally optimal and there might be other (i.e., different) weight vectors that perform equally well or even better. For this reason, the actual parameter values are not usually investigated. Our experiments, however, show that the parameter values and their changes observed when switching from general-domain to domain-specific tuning are in fact highly consistent, indicating interesting trends (compare the black and grey bars in Fig.
<xref rid="Fig3" ref-type="fig">3</xref>
).
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>Visualisation of model weights of the four systems in the twelve evaluation scenarios; the
<italic>black bars</italic>
refer to model weights of the systems tuned on general-domain (
<italic>gen</italic>
) development sets, while the
<italic>grey bars</italic>
refer to the model weights of the systems tuned on domain-specific development sets (
<italic>env</italic>
,
<italic>lab</italic>
)</p>
</caption>
<graphic xlink:href="10579_2014_9282_Fig3_HTML" id="MO6"></graphic>
</fig>
</p>
<p>The high weights assigned to
<inline-formula id="IEq73">
<alternatives>
<tex-math id="M135">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{11}$$\end{document}</tex-math>
<mml:math id="M136">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>11</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq73.gif"></inline-graphic>
</alternatives>
</inline-formula>
(
<italic>direct phrase translation probability</italic>
) of the general-domain tuned systems (black bars) indicate that the phrase pairs in the systems’ translation tables apply well to the development data that are from the same domain as the training data; a high reward is given to translation hypotheses consisting of phrases with high translation probability (i.e., good general-domain translations). The low negative weights assigned to
<inline-formula id="IEq74">
<alternatives>
<tex-math id="M137">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{13}$$\end{document}</tex-math>
<mml:math id="M138">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>13</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq74.gif"></inline-graphic>
</alternatives>
</inline-formula>
(
<italic>phrase penalty</italic>
) imply that the systems prefer hypotheses consisting of fewer but longer phrases. Reordering in the hypotheses is not rewarded and therefore not explicitly preferred (the weights of the reordering models
<inline-formula id="IEq75">
<alternatives>
<tex-math id="M139">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{1}$$\end{document}</tex-math>
<mml:math id="M140">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq75.gif"></inline-graphic>
</alternatives>
</inline-formula>
<inline-formula id="IEq76">
<alternatives>
<tex-math id="M141">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{7}$$\end{document}</tex-math>
<mml:math id="M142">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>7</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq76.gif"></inline-graphic>
</alternatives>
</inline-formula>
are assigned values around zero). In some scenarios (e.g., for EN–FR and FR–EN), certain reordering schemes are even slightly penalised (several weights of
<inline-formula id="IEq77">
<alternatives>
<tex-math id="M143">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{1}$$\end{document}</tex-math>
<mml:math id="M144">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq77.gif"></inline-graphic>
</alternatives>
</inline-formula>
<inline-formula id="IEq78">
<alternatives>
<tex-math id="M145">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{7}$$\end{document}</tex-math>
<mml:math id="M146">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>7</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq78.gif"></inline-graphic>
</alternatives>
</inline-formula>
have negative values). The weight of
<inline-formula id="IEq79">
<alternatives>
<tex-math id="M147">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{14}$$\end{document}</tex-math>
<mml:math id="M148">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>14</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq79.gif"></inline-graphic>
</alternatives>
</inline-formula>
(
<italic>word penalty</italic>
) is negative for the systems translating from English and slightly positive for systems translating into English. This reflects the fact that translation from English prefers shorter hypotheses (fewer words), while translation into English prefers longer hypotheses (consisting of more words). This is probably due to the relative morphological complexities of English and the other languages.</p>
<p>Comparing these findings with the results of the systems tuned on the specific domains (grey bars), we observe that the weights of
<inline-formula id="IEq80">
<alternatives>
<tex-math id="M149">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{11}$$\end{document}</tex-math>
<mml:math id="M150">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>11</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq80.gif"></inline-graphic>
</alternatives>
</inline-formula>
(
<italic>direct phrase translation probability</italic>
) decrease rapidly, with this weight being close to zero in some scenarios. The translation tables do not provide enough good quality translations for the specific domains, and the best translations of the development sentences consist of phrases with varying translation probabilities. Hypotheses consisting of few (and long) phrases are not rewarded any more (weights of
<inline-formula id="IEq81">
<alternatives>
<tex-math id="M151">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{13}$$\end{document}</tex-math>
<mml:math id="M152">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>13</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq81.gif"></inline-graphic>
</alternatives>
</inline-formula>
are higher); in most cases they are penalised and hypotheses consisting of more (and short) phrases are allowed or even preferred. In almost all cases, the reordering feature weights (features
<inline-formula id="IEq82">
<alternatives>
<tex-math id="M153">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_1$$\end{document}</tex-math>
<mml:math id="M154">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq82.gif"></inline-graphic>
</alternatives>
</inline-formula>
<inline-formula id="IEq83">
<alternatives>
<tex-math id="M155">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_7$$\end{document}</tex-math>
<mml:math id="M156">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>7</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq83.gif"></inline-graphic>
</alternatives>
</inline-formula>
) increased substantially, and for domain-specific data the model significantly prefers hypotheses with specific reordering (which is consistent with the two preceding observations). Language model weights (
<inline-formula id="IEq84">
<alternatives>
<tex-math id="M157">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_8$$\end{document}</tex-math>
<mml:math id="M158">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>8</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq84.gif"></inline-graphic>
</alternatives>
</inline-formula>
) do not change substantially, with its importance as a feature remaining similar on general-domain and domain-specific data.</p>
<p>As can be seen in Fig. 
<xref rid="Fig3" ref-type="fig">3</xref>
, these findings are consistent across domains and language pairs. The weight vectors of the systems tuned on domain-specific data are quite similar but differ substantially from the parameters obtained by tuning on general domain data.</p>
</sec>
<sec id="Sec19">
<title>Analysis of phrase-length distribution</title>
<p>From the analysis presented above, we conclude that a PB-SMT system tuned on data from the same domain as the training data strongly prefers to construct translations consisting of long phrases. Such phrases are usually of good translation quality (local mistakes of word alignment disappear), fluent (formed by consecutive sequences of words), and recurrent (frequent in data from the same domain). Accordingly, they form good translations of the input sentences and are preferred during decoding. This is, of course, a positive trait when the system translates sentences from the same domain. However, if this is not the case and the input sentences contain very few longer phrases from the translation tables, the general-domain tuned system is not able to construct good translations by preferring the longer and (for this domain) inadequate phrases. In this case, shorter phrases could enable better translations to be stitched together.</p>
<p>To support this hypothesis, we analysed the phrase length distribution actually seen in the translation of the test sets. The average phrase lengths estimated for various combinations of tuning and test domains and all language pairs are shown in Table 
<xref rid="Tab11" ref-type="table">11</xref>
. The highest values are observed for translations of general-domain test sets by systems tuned on the same domain: 3.49 on average across all language pairs. The scores for systems trained on general-domain and tuned and tested on domain-specific data are significantly lower and range from 1.54 to 2.97, depending on the domain and language pair. Figure 
<xref rid="Fig4" ref-type="fig">4</xref>
illustrates the complete phrase-length distribution in EN–FR translations by systems tuned and tested on various combinations of general and specific domains.
<table-wrap id="Tab11">
<label>Table 11</label>
<caption>
<p>Average phrase length in translations by systems tuned/tested on various combination of domains</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left">
<italic>gen</italic>
/
<italic>gen</italic>
</th>
<th align="left">
<italic>gen</italic>
/
<italic>env</italic>
</th>
<th align="left">
<italic>env</italic>
/
<italic>env</italic>
</th>
<th align="left">
<italic>gen</italic>
/
<italic>lab</italic>
</th>
<th align="left">
<italic>lab</italic>
/
<italic>lab</italic>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">English–French</td>
<td char="." align="char">4.37</td>
<td char="." align="char">3.00</td>
<td char="." align="char">2.16</td>
<td char="." align="char">2.82</td>
<td char="." align="char">2.05</td>
</tr>
<tr>
<td align="left">French–English</td>
<td char="." align="char">3.46</td>
<td char="." align="char">2.49</td>
<td char="." align="char">1.77</td>
<td char="." align="char">2.45</td>
<td char="." align="char">1.83</td>
</tr>
<tr>
<td align="left">English–Greek</td>
<td char="." align="char">3.76</td>
<td char="." align="char">2.69</td>
<td char="." align="char">2.17</td>
<td char="." align="char">2.97</td>
<td char="." align="char">2.46</td>
</tr>
<tr>
<td align="left">Greek–English</td>
<td char="." align="char">2.35</td>
<td char="." align="char">2.18</td>
<td char="." align="char">1.54</td>
<td char="." align="char">2.43</td>
<td char="." align="char">2.30</td>
</tr>
<tr>
<td align="left">Average</td>
<td char="." align="char">3.49</td>
<td char="." align="char">2.59</td>
<td char="." align="char">1.91</td>
<td char="." align="char">2.67</td>
<td char="." align="char">2.16</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Distribution of phrase length in English–French translations by systems tuned/tested on various combinations of general (
<italic>gen</italic>
) and specific (
<italic>env</italic>
,
<italic>lab</italic>
) domains (maximum phrase length set to seven)</p>
</caption>
<graphic xlink:href="10579_2014_9282_Fig4_HTML" id="MO7"></graphic>
</fig>
</p>
<p>Generally, a higher divergence of the test domain from the training domain leads to shorter phrases being used in translation. However, when the systems tuned on general-domain data are applied to specific domains, the average phrase lengths are consistently longer than for domain-specific tuning. The systems are tuned to prefer long phrases but the translation quality is lower. This situation can be interpreted as overtraining; the model overfits the training (and tuning) data and on a different domain fails to form the best possible translations (given the translation, reordering, and language models). Nevertheless, preferring translations constructed of shorter phrases (even single words) is not always better. For example, word-by-word translation of non-compositional phrases would generally be erroneous.</p>
</sec>
<sec id="Sec20">
<title>Other alternatives to parameter optimisation</title>
<p>As we have already shown, in-domain tuning represents a way to effectively reduce such overfitting. The problem, however, can also be reduced by cross tuning, i.e., tuning on specific domains different from the test domains (tuning on
<italic>lab</italic>
and testing on
<italic>env</italic>
, and vice versa), see Table 
<xref rid="Tab10" ref-type="table">10</xref>
, column
<italic>P3</italic>
. In three scenarios (bold figures), such systems perform as well as the in-domain tuned ones (no statistically significant difference). In the other scenarios, the absolute difference in BLEU is less than 0.4 points. The average gain over the systems tuned on the general domain (
<italic>B0</italic>
) is 6.38 points absolute (compared with 6.64 points obtained by
<italic>P1</italic>
). This observation is not very intuitive. One would expect that each domain would require specific tuning. However, it seems that the in-domain tuning does not optimize the general-domain trained system to a particular specific domain, but rather to
<italic>any</italic>
domain diverging from the general domain in a similar way (e.g., to the extent that the translation model and language model cover the test data).</p>
<p>For comparison purposes only, we also report results of non-tuned systems
<italic>P4</italic>
using the default weight vectors set by Moses (
<inline-formula id="IEq85">
<alternatives>
<tex-math id="M159">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{1,\dots ,7}=0.3, h_8=0.5, h_{9,\dots ,13}=0.2, h_{14}=-1$$\end{document}</tex-math>
<mml:math id="M160">
<mml:mrow>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mn>7</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>0.3</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>8</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>0.5</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mrow>
<mml:mn>9</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo></mml:mo>
<mml:mo>,</mml:mo>
<mml:mn>13</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>0.2</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>14</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq85.gif"></inline-graphic>
</alternatives>
</inline-formula>
). Even this approach outperforms the baseline systems
<italic>B0</italic>
. In some cases (e.g., the EN–EL translations), the results are very close to those of systems tuned on in-domain data (
<italic>P1</italic>
). The average absolute improvement of the systems with default parameters (
<italic>P4</italic>
) over the systems tuned on general domain is 4.42 BLEU points (compared with 6.64 points obtained from domain-specific tuning on average).</p>
</sec>
<sec id="Sec21">
<title>Analysis of learning curves</title>
<p>Often, domain-specific parallel data is scarce, or completely unavailable for many vertical sectors and must be prepared by manual translation of monolingual in-domain sentences. We thus investigate how much development data is needed. The only technical requirement is that the parameter optimisation method (MERT, here) must converge in a reasonable number of iterations. For this reason, typical development sets contain about 1,000–2,000 sentence pairs (cf. the size of development sets provided for the WMT translation shared tasks). We vary the amount of sentences in our development sets, tune the systems, test their performance on the test sets and plot learning curves to capture the correlation between translation quality (in terms of BLEU) and gradual increases in the size of the development data.
<fig id="Fig5">
<label>Fig. 5</label>
<caption>
<p>Translation quality (BLEU) of FR–EN systems tuned on data of varying size. The domains of the development and test sets are given in this order (
<italic>dev</italic>
/
<italic>test</italic>
)</p>
</caption>
<graphic xlink:href="10579_2014_9282_Fig5_HTML" id="MO8"></graphic>
</fig>
</p>
<p>The general shapes of the curves are consistent across all language pairs and thus we provide the curves for the EN–FR translation direction only (see Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
). Increasing the size of development sets is beneficial only where the domains of development and test data are the same. The curve of the system tuned and tested on the general domain reaches a plateau at about 500 sentence pairs. In the case of in-domain tuning for specific domains, the plateau is reached much earlier. Usually, as few as 100–200 sentence pairs are enough to obtain optimal results. This is encouraging, as tuning on specific domains yields the best results, and fortunately requires only very limited amounts of bilingual data and seems reasonably tolerant to imperfect translations and noise in the development sentences. The development sets of more than 400–600 sentence pairs do not improve translation quality at all and make the tuning process take longer; by the same token, the additional tuning data does not actively degrade performance, so there is no need to reduce the size of the tuning set. The systems tuned on the general-domain data and tested on specific domains do not benefit from the development data at all; the initial and relatively high BLEU scores achieved with zero-size development data sets (i.e., no tuning) decrease with increasing size of the domain-specific development sets (see the curves denoted as
<italic>gen</italic>
/
<italic>env</italic>
and
<italic>gen</italic>
/
<italic>lab</italic>
in Fig. 
<xref rid="Fig5" ref-type="fig">5</xref>
).</p>
</sec>
</sec>
<sec id="Sec22">
<title>Adaptation of language and translation models</title>
<p>In this section, we explore the potential of adapting the components of the SMT model (language and translation models) by exploiting the crawled domain-specific data in addition to the general-domain data used for training the baseline systems.</p>
<sec id="Sec23">
<title>Language model adaptation</title>
<p>Improving an SMT system by adding in-domain monolingual training data cannot reduce OOV rates nor introduce new phrase pairs into the translation models. Such data can, however, improve the language models and contribute to better estimates of translation fluency and thus help select better translation hypotheses.</p>
<p>In general, there are two ways of using monolingual data for adaptation of the SMT model: the trivial approach is to retrain the existing language model on a simple concatenation of the original general-domain data and the new domain-specific data; a more advanced approach is to build an additional language model based on the domain-specific data only and use it together with the original one. This is possible in two ways (Foster and Kuhn
<xref ref-type="bibr" rid="CR26">2007</xref>
): the two models can be merged by linear interpolation into one model or used directly as components in the log-linear combination of the system. The two approaches are similar but not identical. Both are parametrised by a single weight corresponding to the relative importance of the two models (a linear interpolation coefficient and a model weight, respectively) and require optimisation. Linear interpolation can be optimised by minimising perplexity of some target-language data (e.g., the target side of the development set). Log-linear combination allows direct optimisation of MT quality (e.g., by MERT).
<table-wrap id="Tab12">
<label>Table 12</label>
<caption>
<p>Results of language model adaptation by concatenation of training data (
<italic>L1</italic>
), linear interpolation of general-domain and domain-specific models (
<italic>L2</italic>
), and employing the two independent models in log-linear combination (
<italic>L3</italic>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="2">Direction</th>
<th align="left" rowspan="2">Test</th>
<th align="left">Base (
<italic>P1</italic>
)</th>
<th align="left" colspan="2">Concatenation (
<italic>L1</italic>
)</th>
<th align="left" colspan="2">Lin. interpol. (
<italic>L2</italic>
)</th>
<th align="left" colspan="2">Log-lin. comb. (
<italic>L3</italic>
)</th>
</tr>
<tr>
<th align="left">BLEU</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq87">
<alternatives>
<tex-math id="M161">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M162">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq87.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq88">
<alternatives>
<tex-math id="M163">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M164">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq88.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq89">
<alternatives>
<tex-math id="M165">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M166">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq89.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="2">English–French</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">37.51</td>
<td char="." align="char">41.28</td>
<td char="." align="char">3.77</td>
<td char="." align="char">
<bold>41.78</bold>
</td>
<td char="." align="char">
<bold>4.27</bold>
</td>
<td char="." align="char">41.25</td>
<td char="." align="char">3.74</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">32.15</td>
<td char="." align="char">36.15</td>
<td char="." align="char">4.00</td>
<td char="." align="char">
<bold>38.54</bold>
</td>
<td char="." align="char">
<bold>6.39</bold>
</td>
<td char="." align="char">35.54</td>
<td char="." align="char">3.39</td>
</tr>
<tr>
<td align="left" rowspan="2">French–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">39.05</td>
<td char="." align="char">40.58</td>
<td char="." align="char">1.53</td>
<td char="." align="char">
<bold>42.63</bold>
</td>
<td char="." align="char">
<bold>3.58</bold>
</td>
<td char="." align="char">39.93</td>
<td char="." align="char">0.88</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">33.48</td>
<td char="." align="char">38.05</td>
<td char="." align="char">4.57</td>
<td char="." align="char">
<bold>41.11</bold>
</td>
<td char="." align="char">
<bold>7.63</bold>
</td>
<td char="." align="char">33.95</td>
<td char="." align="char">0.47</td>
</tr>
<tr>
<td align="left" rowspan="2">English–Greek</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">27.56</td>
<td char="." align="char">33.59</td>
<td char="." align="char">6.03</td>
<td char="." align="char">
<bold>34.89</bold>
</td>
<td char="." align="char">
<bold>7.33</bold>
</td>
<td char="." align="char">33.65</td>
<td char="." align="char">6.09</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">30.07</td>
<td char="." align="char">
<bold>35.09</bold>
</td>
<td char="." align="char">
<bold>5.02</bold>
</td>
<td char="." align="char">34.15</td>
<td char="." align="char">4.08</td>
<td char="." align="char">34.33</td>
<td char="." align="char">4.26</td>
</tr>
<tr>
<td align="left" rowspan="2">Greek–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">34.31</td>
<td char="." align="char">37.03</td>
<td char="." align="char">2.72</td>
<td char="." align="char">
<bold>37.57</bold>
</td>
<td char="." align="char">
<bold>3.26</bold>
</td>
<td char="." align="char">36.55</td>
<td char="." align="char">2.24</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">37.57</td>
<td char="." align="char">
<bold>40.15</bold>
</td>
<td char="." align="char">
<bold>2.58</bold>
</td>
<td char="." align="char">
<bold>40.09</bold>
</td>
<td char="." align="char">
<bold>2.52</bold>
</td>
<td char="." align="char">
<bold>40.01</bold>
</td>
<td char="." align="char">
<bold>2.44</bold>
</td>
</tr>
<tr>
<td align="left">Average</td>
<td align="left"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char">3.78</td>
<td char="." align="char"></td>
<td char="." align="char">4.88</td>
<td char="." align="char"></td>
<td char="." align="char">2.94</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<inline-formula id="IEq86">
<alternatives>
<tex-math id="M167">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M168">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq86.gif"></inline-graphic>
</alternatives>
</inline-formula>
refers to absolute improvement in BLEU over
<italic>P1</italic>
trained on general domain and tuned for specific domains</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>We experiment with all three approaches combining general-domain data (comprising 27–53 million tokens, see Table 
<xref rid="Tab8" ref-type="table">8</xref>
) and in-domain data (15–45 million tokens, see Table 
<xref rid="Tab3" ref-type="table">3</xref>
). System
<italic>L1</italic>
exploits the simple concatenation of the data,
<italic>L2</italic>
is based on linear combination optimized on the target side of the (in-domain) development data, and
<italic>L3</italic>
employs two models combined in the log-linear fashion using weights tuned by MERT on BLEU. The complete results are presented in Table 
<xref rid="Tab12" ref-type="table">12</xref>
. Compared to the in-domain tuned systems (
<italic>P1</italic>
), all three methods significantly improve translation quality across all scenarios. In general, the most efficient approach is linear interpolation with an average absolute improvement of 4.88 BLEU points (14.95 % relative). With two exceptions, systems
<italic>L2</italic>
outperform both
<italic>L1</italic>
and
<italic>L3</italic>
. In most cases, the improvement is statistically significant. For EN–EL (both directions) in the
<italic>lab</italic>
domain,
<italic>L2</italic>
is outperformed by simple concatenation (
<italic>L1</italic>
), but this can be explained by the size of the development data used to optimize the interpolation coefficient in
<italic>L2</italic>
(506 sentences), which is probably insufficient. Substantial improvements in BLEU over the system
<italic>P1</italic>
are achieved especially for translations into Greek (7.33 points for
<italic>env</italic>
, and 5.02 points for
<italic>lab</italic>
, both absolute) despite the smallest size of the monolingual data acquired for this language (see Table 
<xref rid="Tab3" ref-type="table">3</xref>
), which is probably due to the complex Greek morphology and the subsequent problem of data sparsity.</p>
</sec>
<sec id="Sec24">
<title>Translation model adaptation</title>
<p>Parallel data is essential for building translation models of SMT systems. While a good language model can improve an SMT system by preferring better phrase translation options in given contexts, it has no effect if the translation model fails to provide a phrase translation at all. In this experiment, we analyse the effect of using the domain-specific parallel training data acquired as described in Sect. 
<xref rid="Sec9" ref-type="sec">3.2</xref>
. These data sets are relatively small, comprising 7,000–20,000 sentence pairs, depending on the language pair and domain (see Table 
<xref rid="Tab5" ref-type="table">5</xref>
).</p>
<p>Similar to language model adaptation discussed in the previous subsection, there are three main methods to combine parallel training data from two sources (Banerjee et al.
<xref ref-type="bibr" rid="CR4">2011</xref>
): first, retraining the existing translation model on a simple concatenation of the original general-domain and the new domain-specific data; second, training a new translation model on the domain-specific data and interpolating the two models in a linear fashion; and third, using the two translation models in log-linear combination. The first approach does not require optimization of any additional parameters. The second approach requires tuning of four extra coefficients (one for each of the probability distributions provided by the translation model, i.e.,
<inline-formula id="IEq90">
<alternatives>
<tex-math id="M169">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_9$$\end{document}</tex-math>
<mml:math id="M170">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>9</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq90.gif"></inline-graphic>
</alternatives>
</inline-formula>
<inline-formula id="IEq91">
<alternatives>
<tex-math id="M171">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{12}$$\end{document}</tex-math>
<mml:math id="M172">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>12</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq91.gif"></inline-graphic>
</alternatives>
</inline-formula>
), which is usually done by minimizing perplexity of the development data (Sennrich
<xref ref-type="bibr" rid="CR60">2012</xref>
). The third approach adds the total of five new weights (associated with the new translation model) to the weight vector, which is then optimized in the traditional way by maximising translation quality on the development data (by MERT, in our case).
<table-wrap id="Tab13">
<label>Table 13</label>
<caption>
<p>Results of translation model adaptation by concatenation of training data (
<italic>T1</italic>
), linear interpolation of general-domain and domain-specific models (
<italic>T2</italic>
), and employing the independent models in log-linear combination (
<italic>T3</italic>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="2">Direction</th>
<th align="left" rowspan="2">Test</th>
<th align="left">Base (
<italic>P1</italic>
)</th>
<th align="left" colspan="2">Concatenation (
<italic>T1</italic>
)</th>
<th align="left" colspan="2">Lin. interpol. (
<italic>T2</italic>
)</th>
<th align="left" colspan="2">Log-lin. comb. (
<italic>T3</italic>
)</th>
</tr>
<tr>
<th align="left">BLEU</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq93">
<alternatives>
<tex-math id="M173">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M174">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq93.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq94">
<alternatives>
<tex-math id="M175">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M176">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq94.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq95">
<alternatives>
<tex-math id="M177">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M178">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq95.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">English–French</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">37.51</td>
<td char="." align="char">
<bold>39.61</bold>
</td>
<td char="." align="char">
<bold>2.10</bold>
</td>
<td char="." align="char">
<bold>39.85</bold>
</td>
<td char="." align="char">
<bold>2.34</bold>
</td>
<td char="." align="char">
<bold>39.76</bold>
</td>
<td char="." align="char">
<bold>2.25</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">32.15</td>
<td char="." align="char">41.33</td>
<td char="." align="char">9.18</td>
<td char="." align="char">42.08</td>
<td char="." align="char">9.93</td>
<td char="." align="char">
<bold>42.70</bold>
</td>
<td char="." align="char">
<bold>10.55</bold>
</td>
</tr>
<tr>
<td align="left">French–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">39.05</td>
<td char="." align="char">41.08</td>
<td char="." align="char">2.03</td>
<td char="." align="char">
<bold>41.92</bold>
</td>
<td char="." align="char">
<bold>2.87</bold>
</td>
<td char="." align="char">41.65</td>
<td char="." align="char">2.60</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">33.48</td>
<td char="." align="char">43.54</td>
<td char="." align="char">10.06</td>
<td char="." align="char">
<bold>45.06</bold>
</td>
<td char="." align="char">
<bold>11.58</bold>
</td>
<td char="." align="char">
<bold>45.12</bold>
</td>
<td char="." align="char">
<bold>11.64</bold>
</td>
</tr>
<tr>
<td align="left">English–Greek</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">27.56</td>
<td char="." align="char">30.73</td>
<td char="." align="char">3.17</td>
<td char="." align="char">30.74</td>
<td char="." align="char">3.18</td>
<td char="." align="char">
<bold>31.89</bold>
</td>
<td char="." align="char">
<bold>4.33</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">30.07</td>
<td char="." align="char">
<bold>30.48</bold>
</td>
<td char="." align="char">
<bold>0.41</bold>
</td>
<td char="." align="char">
<bold>30.51</bold>
</td>
<td char="." align="char">
<bold>0.44</bold>
</td>
<td char="." align="char">
<bold>30.51</bold>
</td>
<td char="." align="char">
<bold>0.44</bold>
</td>
</tr>
<tr>
<td align="left">Greek–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">34.31</td>
<td char="." align="char">38.35</td>
<td char="." align="char">4.04</td>
<td char="." align="char">38.12</td>
<td char="." align="char">3.81</td>
<td char="." align="char">
<bold>38.66</bold>
</td>
<td char="." align="char">
<bold>4.35</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">37.57</td>
<td char="." align="char">
<bold>38.07</bold>
</td>
<td char="." align="char">
<bold>0.50</bold>
</td>
<td char="." align="char">
<bold>38.20</bold>
</td>
<td char="." align="char">
<bold>0.63</bold>
</td>
<td char="." align="char">37.90</td>
<td char="." align="char">0.33</td>
</tr>
<tr>
<td align="left">Average</td>
<td align="left"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char">3.94</td>
<td char="." align="char"></td>
<td char="." align="char">4.35</td>
<td char="." align="char"></td>
<td char="." align="char">4.56</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<inline-formula id="IEq92">
<alternatives>
<tex-math id="M179">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M180">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq92.gif"></inline-graphic>
</alternatives>
</inline-formula>
refers to absolute improvement in BLEU over
<italic>P1</italic>
trained on general domain and tuned for specific domains</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>We test all the alternative approaches, which are realised as systems
<italic>T1</italic>
(single translation model trained on a concatenation of data),
<italic>T2</italic>
(linear interpolation of the two translation models),
<italic>T3</italic>
(two independent translation models in log-linear combination), and compared with the in-domain-tuned systems (
<italic>P1</italic>
) in Table 
<xref rid="Tab13" ref-type="table">13</xref>
. We again observe substantial improvements in translation quality in all scenarios. However, there is no clear winner in this case: although the two more advanced methods (systems
<italic>T2</italic>
and
<italic>T3</italic>
) outperform the trivial one (system
<italic>T0</italic>
), the difference between the two is marginal. The average increase in BLEU for
<italic>T2</italic>
over
<italic>T1</italic>
is 4.35 points absolute (13.11 % relative) and for
<italic>T3</italic>
over
<italic>P1</italic>
4.56 points absolute (13.87 % relative). In three of the eight scenarios, the difference is not statistically significant,
<italic>T2</italic>
is significantly better in two scenarios, and
<italic>T3</italic>
is better in three scenarios (see Table 
<xref rid="Tab13" ref-type="table">13</xref>
).</p>
<p>The most substantial gain obtained by exploiting the domain-specific parallel training data is observed for the EN–FR language pair (in both translation directions) and the
<italic>lab</italic>
domain, where BLEU scores increase by 10.55–11.64 points absolute (for system
<italic>T3</italic>
), while in other scenarios the increase in BLEU is between 0.33 and 4.35 points absolute only. This can be explained by the better match between the training and test data, which is evident from the decrease in perplexity of the reference translations given the target language models, as discussed in the following section. This is likely to be caused by the size of the in-domain parallel training data for this language pair and domain which is more than twice as large compared to the EN–FR
<italic>env</italic>
data and more than three times larger compared to the EN–EL data, both for the
<italic>env</italic>
and
<italic>lab</italic>
domains (see Table 
<xref rid="Tab3" ref-type="table">3</xref>
).</p>
<p>In further experiments, we test the techniques for translation model adaptation in systems with language models adapted by linear interpolation, which proved to be the most effective method for language model adaptation. Overall, the results presented in Table
<xref rid="Tab14" ref-type="table">14</xref>
are very positive: the improvements obtained by translation model adaptation are to a large extent preserved even when this method is applied together with language model adaptation. While linear interpolation of translation models realised in systems
<italic>T2</italic>
increases BLEU by 4.35 points absolute (
<italic>T2</italic>
over
<italic>P1</italic>
, see Table
<xref rid="Tab13" ref-type="table">13</xref>
), the same technique adds an additional 3.78 BLEU points when applied together with linear interpolation of language models (
<italic>C2</italic>
over
<italic>L2</italic>
, see Table
<xref rid="Tab14" ref-type="table">14</xref>
). The effect of using in-domain monolingual and parallel data is largely independent and does not cancel out when these two types of resources are used at the same time. On average, linear-interpolation outperforms the other two techniques (
<italic>C1</italic>
and
<italic>C3</italic>
), but in most scenarios the difference is not statistically significant (cf. the bold figures in Table
<xref rid="Tab14" ref-type="table">14</xref>
).
<table-wrap id="Tab14">
<label>Table 14</label>
<caption>
<p>Results of complete adaptation. Language models in all systems are adapted by linear interpolation; translation models are adapted by concatenation of training data (
<italic>C1</italic>
), linear interpolation of general-domain and domain-specific models (
<italic>C2</italic>
), and employing the independent models in log-linear combination (
<italic>C3</italic>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="2">Direction</th>
<th align="left" rowspan="2">Test</th>
<th align="left">Base (
<italic>L2</italic>
)</th>
<th align="left" colspan="2">Concatenation (
<italic>C1</italic>
)</th>
<th align="left" colspan="2">Lin. interpol. (
<italic>C2</italic>
)</th>
<th align="left" colspan="2">Log-lin. comb. (
<italic>C3</italic>
)</th>
</tr>
<tr>
<th align="left">BLEU</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq97">
<alternatives>
<tex-math id="M181">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M182">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq97.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq98">
<alternatives>
<tex-math id="M183">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M184">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq98.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq99">
<alternatives>
<tex-math id="M185">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M186">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq99.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">English–French</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">41.78</td>
<td char="." align="char">
<bold>43.70</bold>
</td>
<td char="." align="char">
<bold>1.92</bold>
</td>
<td char="." align="char">
<bold>43.85</bold>
</td>
<td char="." align="char">
<bold>2.07</bold>
</td>
<td char="." align="char">
<bold>43.75</bold>
</td>
<td char="." align="char">
<bold>1.97</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">38.54</td>
<td char="." align="char">47.45</td>
<td char="." align="char">8.91</td>
<td char="." align="char">
<bold>48.31</bold>
</td>
<td char="." align="char">
<bold>9.77</bold>
</td>
<td char="." align="char">47.96</td>
<td char="." align="char">9.42</td>
</tr>
<tr>
<td align="left">French–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">42.63</td>
<td char="." align="char">43.93</td>
<td char="." align="char">1.30</td>
<td char="." align="char">
<bold>44.22</bold>
</td>
<td char="." align="char">
<bold>1.59</bold>
</td>
<td char="." align="char">
<bold>44.12</bold>
</td>
<td char="." align="char">
<bold>1.49</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">41.11</td>
<td char="." align="char">50.07</td>
<td char="." align="char">8.96</td>
<td char="." align="char">
<bold>50.56</bold>
</td>
<td char="." align="char">
<bold>9.45</bold>
</td>
<td char="." align="char">
<bold>50.34</bold>
</td>
<td char="." align="char">
<bold>9.23</bold>
</td>
</tr>
<tr>
<td align="left">English–Greek</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">34.89</td>
<td char="." align="char">
<bold>38.41</bold>
</td>
<td char="." align="char">
<bold>3.52</bold>
</td>
<td char="." align="char">37.90</td>
<td char="." align="char">3.01</td>
<td char="." align="char">
<bold>38.22</bold>
</td>
<td char="." align="char">
<bold>3.33</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">34.15</td>
<td char="." align="char">34.29</td>
<td char="." align="char">0.14</td>
<td char="." align="char">
<bold>34.76</bold>
</td>
<td char="." align="char">
<bold>0.61</bold>
</td>
<td char="." align="char">
<bold>34.48</bold>
</td>
<td char="." align="char">
<bold>0.33</bold>
</td>
</tr>
<tr>
<td align="left">Greek–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">37.57</td>
<td char="." align="char">
<bold>40.85</bold>
</td>
<td char="." align="char">
<bold>3.28</bold>
</td>
<td char="." align="char">
<bold>40.64</bold>
</td>
<td char="." align="char">
<bold>3.07</bold>
</td>
<td char="." align="char">
<bold>40.81</bold>
</td>
<td char="." align="char">
<bold>3.24</bold>
</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">40.09</td>
<td char="." align="char">
<bold>40.69</bold>
</td>
<td char="." align="char">
<bold>0.60</bold>
</td>
<td char="." align="char">
<bold>40.75</bold>
</td>
<td char="." align="char">
<bold>0.66</bold>
</td>
<td char="." align="char">
<bold>40.62</bold>
</td>
<td char="." align="char">
<bold>0.53</bold>
</td>
</tr>
<tr>
<td align="left">Average</td>
<td align="left"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char">3.58</td>
<td char="." align="char"></td>
<td char="." align="char">3.78</td>
<td char="." align="char"></td>
<td char="." align="char">3.69</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<inline-formula id="IEq96">
<alternatives>
<tex-math id="M187">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M188">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq96.gif"></inline-graphic>
</alternatives>
</inline-formula>
refers to absolute improvement in BLEU over
<italic>L2</italic>
with translation models trained on general-domain data only</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="Sec25">
<title>Complete adaptation and result analysis</title>
<p>In this section, we summarise the main results achieved by incremental improvements obtained by adaptation of various components of a PB-SMT system. and compare them with the original baseline systems trained and tuned on general-domain data only. The results are accompanied by further analysis of three factors: OOV rate in test sentences, perplexity of reference translations given the target language models, and average phrase length in test translations.</p>
<p>The main results in terms of BLEU are presented in Table 
<xref rid="Tab15" ref-type="table">15</xref>
, with the detailed characteristics of the systems given in Table 
<xref rid="Tab16" ref-type="table">16</xref>
. On average, in-domain parameter tuning (
<italic>P1</italic>
) improves BLEU by 6.64 points absolute (24.82 % relative). Components of the log-linear combination do not change, so OOV and perplexity remain the same. The average phrase length dropped from 2.63 to 2.04 words, i.e., by 22.5 %. The adapted language model (log-linear interpolation of general-domain and domain-specific models tuned on the target side of development data, systems
<italic>L2</italic>
) increased the gain in BLEU to 11.52 points absolute (43.73 % relative). The perplexity of the reference translations given the target language models dropped by 45.4 % on average. The average phrase length decreased to 1.87 words. The language model matches the test data domain better and helps to select better translation hypotheses, which consist of even more (and shorter, eventually reordered) phrases.</p>
<p>Finally, adaptation of the translation model (using linear interpolation general-domain and domain-specific models tuned on the development data, systems
<italic>C2</italic>
) boosts the average relative improvement in BLEU to 15.30 points absolute (58.37 % relative). This step introduces new translation phrase pairs into the translation model and decreases the OOV rate. Compared to the baseline (
<italic>B0</italic>
), OOV drops by 30 % on average. In some scenarios (the EN–FR translation in the
<italic>lab</italic>
domain), OOV decreases by as much as 50 %, which is a sign of a better match between the test and training data. The target side of the parallel data also improves the language models, with their perplexity falling by an average of 67.5 % relative. The new in-domain material in the translation models also leads to longer phrases being used in the best-scored translation hypotheses. The average phrase length increased compared to the systems with adapted language models only (
<italic>L2</italic>
) by almost 20 % to 2.18 words.
<table-wrap id="Tab15">
<label>Table 15</label>
<caption>
<p>Incremental adaptation using various types of domain-specific resources: parallel data for parameter tuning (
<italic>P1</italic>
), monolingual data for improving the language models (
<italic>L2</italic>
), and parallel data for improving the translation model (
<italic>C2</italic>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="2">Direction</th>
<th align="left" rowspan="2">Test</th>
<th align="left">Base (
<italic>B0</italic>
)</th>
<th align="left" colspan="2">+Tuning (
<italic>P1</italic>
)</th>
<th align="left" colspan="2">+Lang. model (
<italic>L2</italic>
)</th>
<th align="left" colspan="2">+Transl. model (
<italic>C2</italic>
)</th>
<th align="left" colspan="2">Spec. only (
<italic>C0</italic>
)</th>
</tr>
<tr>
<th align="left">BLEU</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq101">
<alternatives>
<tex-math id="M189">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M190">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq101.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq102">
<alternatives>
<tex-math id="M191">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M192">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq102.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq103">
<alternatives>
<tex-math id="M193">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M194">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq103.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq104">
<alternatives>
<tex-math id="M195">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M196">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq104.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="2">English–French</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">29.61</td>
<td char="." align="char">37.51</td>
<td char="." align="char">7.90</td>
<td char="." align="char">41.78</td>
<td char="." align="char">12.17</td>
<td char="." align="char">39.85</td>
<td char="." align="char">
<bold>14.24</bold>
</td>
<td char="." align="char">39.54</td>
<td char="." align="char">9.93</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">23.94</td>
<td char="." align="char">32.15</td>
<td char="." align="char">8.21</td>
<td char="." align="char">38.54</td>
<td char="." align="char">14.60</td>
<td char="." align="char">42.08</td>
<td char="." align="char">
<bold>24.37</bold>
</td>
<td char="." align="char">43.05</td>
<td char="." align="char">19.11</td>
</tr>
<tr>
<td align="left" rowspan="2">French–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">31.79</td>
<td char="." align="char">39.05</td>
<td char="." align="char">7.26</td>
<td char="." align="char">42.63</td>
<td char="." align="char">10.84</td>
<td char="." align="char">41.92</td>
<td char="." align="char">
<bold>12.43</bold>
</td>
<td char="." align="char">37.86</td>
<td char="." align="char">6.07</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">26.96</td>
<td char="." align="char">33.48</td>
<td char="." align="char">6.52</td>
<td char="." align="char">41.11</td>
<td char="." align="char">14.15</td>
<td char="." align="char">45.06</td>
<td char="." align="char">
<bold>23.60</bold>
</td>
<td char="." align="char">43.74</td>
<td char="." align="char">16.78</td>
</tr>
<tr>
<td align="left" rowspan="2">English–Greek</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">21.20</td>
<td char="." align="char">27.56</td>
<td char="." align="char">6.36</td>
<td char="." align="char">34.89</td>
<td char="." align="char">13.69</td>
<td char="." align="char">30.74</td>
<td char="." align="char">
<bold>16.70</bold>
</td>
<td char="." align="char">29.84</td>
<td char="." align="char">8.64</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">24.04</td>
<td char="." align="char">30.07</td>
<td char="." align="char">6.03</td>
<td char="." align="char">34.15</td>
<td char="." align="char">10.11</td>
<td char="." align="char">30.51</td>
<td char="." align="char">
<bold>10.72</bold>
</td>
<td char="." align="char">26.19</td>
<td char="." align="char">2.15</td>
</tr>
<tr>
<td align="left" rowspan="2">Greek–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">29.31</td>
<td char="." align="char">34.31</td>
<td char="." align="char">5.00</td>
<td char="." align="char">37.57</td>
<td char="." align="char">8.26</td>
<td char="." align="char">38.12</td>
<td char="." align="char">
<bold>11.33</bold>
</td>
<td char="." align="char">30.71</td>
<td char="." align="char">1.40</td>
</tr>
<tr>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">31.73</td>
<td char="." align="char">37.57</td>
<td char="." align="char">5.84</td>
<td char="." align="char">40.09</td>
<td char="." align="char">8.36</td>
<td char="." align="char">38.20</td>
<td char="." align="char">
<bold>9.02</bold>
</td>
<td char="." align="char">29.54</td>
<td char="." align="char">−2.19</td>
</tr>
<tr>
<td align="left">Average</td>
<td align="left"></td>
<td char="." align="char"></td>
<td char="." align="char"></td>
<td char="." align="char">6.64</td>
<td char="." align="char"></td>
<td char="." align="char">11.52</td>
<td char="." align="char"></td>
<td char="." align="char">15.30</td>
<td char="." align="char"></td>
<td char="." align="char">7.74</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<inline-formula id="IEq100">
<alternatives>
<tex-math id="M197">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M198">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq100.gif"></inline-graphic>
</alternatives>
</inline-formula>
refers to absolute improvement in BLEU over the baseline general-domain system (
<italic>B0</italic>
)</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab16">
<label>Table 16</label>
<caption>
<p>Out-of-vocabulary rate (%) in the test sentences (OOV), perplexity of the reference translations given the target language models (PPL), and average phrase length in the test set translations (APL)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" rowspan="2">Direction</th>
<th align="left" rowspan="2">Test</th>
<th align="left" colspan="3">Base (
<italic>B0</italic>
)</th>
<th align="left" colspan="3">+Tuning (
<italic>P1</italic>
)</th>
<th align="left" colspan="3">+Lang. model (
<italic>L2</italic>
)</th>
<th align="left" colspan="3">+Transl. model (
<italic>C2</italic>
)</th>
</tr>
<tr>
<th align="left">OOV</th>
<th align="left">PPL</th>
<th align="left">APL</th>
<th align="left">OOV</th>
<th align="left">PPL</th>
<th align="left">APL</th>
<th align="left">OOV</th>
<th align="left">PPL</th>
<th align="left">APL</th>
<th align="left">OOV</th>
<th align="left">PPL</th>
<th align="left">APL</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">English–French</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">0.98</td>
<td char="." align="char">67.8</td>
<td char="." align="char">3.00</td>
<td char="." align="char">0.98</td>
<td char="." align="char">67.8</td>
<td char="." align="char">2.16</td>
<td char="." align="char">0.98</td>
<td char="." align="char">36.7</td>
<td char="." align="char">2.18</td>
<td char="." align="char">0.65</td>
<td char="." align="char">33.3</td>
<td char="." align="char">2.60</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">0.85</td>
<td char="." align="char">83.2</td>
<td char="." align="char">2.82</td>
<td char="." align="char">0.85</td>
<td char="." align="char">83.2</td>
<td char="." align="char">2.05</td>
<td char="." align="char">0.85</td>
<td char="." align="char">40.9</td>
<td char="." align="char">1.91</td>
<td char="." align="char">0.48</td>
<td char="." align="char">29.2</td>
<td char="." align="char">2.70</td>
</tr>
<tr>
<td align="left">French–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">0.81</td>
<td char="." align="char">122.9</td>
<td char="." align="char">2.49</td>
<td char="." align="char">0.81</td>
<td char="." align="char">122.9</td>
<td char="." align="char">1.77</td>
<td char="." align="char">0.81</td>
<td char="." align="char">80.9</td>
<td char="." align="char">1.75</td>
<td char="." align="char">0.54</td>
<td char="." align="char">68.3</td>
<td char="." align="char">2.15</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">0.68</td>
<td char="." align="char">153.6</td>
<td char="." align="char">2.45</td>
<td char="." align="char">0.68</td>
<td char="." align="char">153.6</td>
<td char="." align="char">1.83</td>
<td char="." align="char">0.68</td>
<td char="." align="char">59.5</td>
<td char="." align="char">1.54</td>
<td char="." align="char">0.38</td>
<td char="." align="char">40.2</td>
<td char="." align="char">1.81</td>
</tr>
<tr>
<td align="left">English–Greek</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">1.15</td>
<td char="." align="char">119.7</td>
<td char="." align="char">2.69</td>
<td char="." align="char">1.15</td>
<td char="." align="char">119.7</td>
<td char="." align="char">2.17</td>
<td char="." align="char">1.15</td>
<td char="." align="char">50.6</td>
<td char="." align="char">1.85</td>
<td char="." align="char">0.82</td>
<td char="." align="char">43.8</td>
<td char="." align="char">2.09</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">0.47</td>
<td char="." align="char">82.1</td>
<td char="." align="char">2.97</td>
<td char="." align="char">0.47</td>
<td char="." align="char">82.1</td>
<td char="." align="char">2.46</td>
<td char="." align="char">0.47</td>
<td char="." align="char">50.4</td>
<td char="." align="char">1.86</td>
<td char="." align="char">0.40</td>
<td char="." align="char">49.1</td>
<td char="." align="char">2.10</td>
</tr>
<tr>
<td align="left">Greek–English</td>
<td align="left">
<italic>env</italic>
</td>
<td char="." align="char">1.53</td>
<td char="." align="char">115.4</td>
<td char="." align="char">2.18</td>
<td char="." align="char">1.53</td>
<td char="." align="char">115.4</td>
<td char="." align="char">1.54</td>
<td char="." align="char">1.53</td>
<td char="." align="char">76.3</td>
<td char="." align="char">1.66</td>
<td char="." align="char">1.20</td>
<td char="." align="char">72.2</td>
<td char="." align="char">2.25</td>
</tr>
<tr>
<td align="left"></td>
<td align="left">
<italic>lab</italic>
</td>
<td char="." align="char">0.69</td>
<td char="." align="char">74.9</td>
<td char="." align="char">2.43</td>
<td char="." align="char">0.69</td>
<td char="." align="char">74.9</td>
<td char="." align="char">2.30</td>
<td char="." align="char">0.69</td>
<td char="." align="char">53.2</td>
<td char="." align="char">2.16</td>
<td char="." align="char">0.62</td>
<td char="." align="char">52.5</td>
<td char="." align="char">1.78</td>
</tr>
<tr>
<td align="left">Average</td>
<td align="left"></td>
<td char="." align="char">0.90</td>
<td char="." align="char">102.5</td>
<td char="." align="char">2.63</td>
<td char="." align="char">0.90</td>
<td char="." align="char">102.5</td>
<td char="." align="char">2.04</td>
<td char="." align="char">0.90</td>
<td char="." align="char">56.0</td>
<td char="." align="char">1.87</td>
<td char="." align="char">0.64</td>
<td char="." align="char">48.6</td>
<td char="." align="char">2.18</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>For comparison, Table
<xref rid="Tab15" ref-type="table">15</xref>
also reports the results of systems trained and tuned solely on domain-specific data (
<italic>C0</italic>
), which illustrates the pure effect of such training data. Although with one exception (EN–EL translation of the
<italic>lab</italic>
domain), these systems outperform the baseline (
<italic>B0</italic>
), the requirement of using general-domain data is evident in all scenarios. The average difference in BLEU of the fully adapted systems (
<italic>C2</italic>
) and the systems trained on specific data only (
<italic>C0</italic>
) is 7.56 points absolute.
<fig id="Fig6">
<label>Fig. 6</label>
<caption>
<p>Visualisation of model weights of the systems presented in Table 
<xref rid="Tab14" ref-type="table">14</xref>
(
<italic>env</italic>
domain only) based on general-domain data for training and tuning (
<italic>B0</italic>
), domain-specific parallel data for tuning (
<italic>P1</italic>
), additional monolingual data for language models (
<italic>L2</italic>
), and additional parallel data for the translation model (
<italic>C2</italic>
)</p>
</caption>
<graphic xlink:href="10579_2014_9282_Fig6_HTML" id="MO9"></graphic>
</fig>
</p>
<p>In Fig. 
<xref rid="Fig6" ref-type="fig">6</xref>
, we visualise the weight vectors of the four systems presented in this section for the
<italic>env</italic>
domain (the trends on the
<italic>lab</italic>
domain are the same). Compared to the baseline (
<italic>B0</italic>
), the in-domain tuned systems (
<italic>P1</italic>
) do not trust the translation model that much and prefer hypotheses consisting of more phrases which are shorter and more reordered. The weight vectors of systems
<italic>L2</italic>
do not change much. A consistent increase, however, is observed for both the language model weight (
<inline-formula id="IEq113">
<alternatives>
<tex-math id="M199">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{8}$$\end{document}</tex-math>
<mml:math id="M200">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>8</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq113.gif"></inline-graphic>
</alternatives>
</inline-formula>
) and phrase penalty (
<inline-formula id="IEq114">
<alternatives>
<tex-math id="M201">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{13}$$\end{document}</tex-math>
<mml:math id="M202">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>13</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq114.gif"></inline-graphic>
</alternatives>
</inline-formula>
). This is natural, as the language models match the test domain better and the systems are better able to construct improved hypotheses consisting of even shorter phrases. The parameters of the fully adapted systems (
<italic>C2</italic>
) changed only slightly. A consistent change is observed for the phrase penalty (
<inline-formula id="IEq115">
<alternatives>
<tex-math id="M203">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$h_{13}$$\end{document}</tex-math>
<mml:math id="M204">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>13</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq115.gif"></inline-graphic>
</alternatives>
</inline-formula>
); in most cases it dropped, which is reflected in an increase in average phrase length in the test translations compared to systems
<italic>L2</italic>
(see Table 
<xref rid="Tab16" ref-type="table">16</xref>
).</p>
</sec>
</sec>
<sec id="Sec26" sec-type="conclusions">
<title>Conclusions</title>
<p>In the first main part of the paper (Sect. 
<xref rid="Sec12" ref-type="sec">4</xref>
), we focused on a detailed exposition of the pipeline for acquisition of monolingual and parallel data from the World Wide Web. In the second part (Sects. 
<xref rid="Sec21" ref-type="sec">5</xref>
<xref rid="Sec22" ref-type="sec">6</xref>
), we added a thorough investigation of the impact of resources that can be generated using the pipeline, focusing in particular on major established/tried-and-tested approaches to domain adaptation of MT. We discussed the effect of tuning and adaptation on SMT system weights, analysed the learning curves of parameter tuning, OOV rates, perplexity of the test data, and phrase length in translations produced during various stages of adaptation.</p>
<p>The pipeline for the acquisition of domain-specific monolingual and parallel texts from the web is based on existing open-source tools for web crawling, text normalisation and cleaning, language identification, duplicate removal, and parallel sentence extraction. It is implemented as easy-to-use web services ready to be employed in industrial scenarios. It requires only limited human intervention for constructing the domain definition and the list of seed URLs, which can easily be tweaked and tuned to acquire texts with high accuracy of 94 %. This pipeline was applied to acquire domain-specific resources for adaptation of a general-domain SMT system. We crawled monolingual and parallel data for two language pairs (English–French, English–Greek) and two domains (environment, labour legislation), which allowed us to perform a large-scale evaluation using a total of eight test scenarios. The acquired data sets are available from ELRA.</p>
<p>Our domain-adaptation experiments focused on the following three components of a PB-SMT model: parameters of the log-linear combination and their optimisation, language model, and translation model. First, we confirmed the observation from previous research that systems trained and tuned on general domain perform poorly on specific domains. This finding is not surprising, but the amount of loss and the fact that it is observed consistently was rather unexpected. The average absolute decrease in BLEU in all the domain-specific evaluation scenarios was 21.82 points (37.86 % relative).</p>
<p>We confirm the results of previous research on tuning-only adaptation. Tuning the general domain-trained systems on specific target domain data recovers a significant amount of the loss. Several hundreds of sentence pairs used as development data improved the BLEU score of the baseline tuned on general-domain data by 6.64 points absolute (24.82 % relative) on average. A detailed analysis of the model parameters and phrase length distribution in translations of the test data found that a system trained and tuned on general domain data strongly prefers long and few phrases in the output translations, and thus underperforms on specific domains where such phrases do not occur so frequently. In contrast, the same systems tuned on domain-specific data produce output translations from shorter phrases, allow specific reordering and perform significantly and consistently better on specific domains.</p>
<p>Importantly, our findings show that the development data does not have to be manually cleaned and corrected, as parameter tuning on the development set (here, using MERT) is quite tolerant to imperfect translations and eventual noise in the development sets. Cross-domain tuning on a different set also offers a good solution when no in-domain development data is available, especially when the domains differ in a similar way. This step has the effect of tweaking the original general-domain system towards shorter phrases and it does not matter much which different development sets are used.</p>
<p>The experiments with language model adaptation confirmed previous results. Linear interpolation of the general-domain and domain-specific models increased translation quality by a further 4.88 BLEU points absolute (14.95 % relative) compared to the general-domain systems tuned on in-domain development sets on average and significantly outperformed other techniques (concatenation of training data and log-linear combination of the two models). Adaptation of translation models (using 7,000–20,000 acquired sentence pairs) increased BLEU scores by 4.56 points absolute (13.87 % relative) compared to the general-domain systems tuned on in-domain development sets. In this case, linear interpolation and log-linear combination produced similar results. In the combined approach, we observed that the effect of using in-domain monolingual and parallel data is largely independent and does not cancel out when these two types of resources are used at the same time. The final BLEU scores increased by 3.78 points absolute (9.66 % relative) with respect to the language-model-adapted systems, by 8.66 points absolute (26.43 % relative) with respect to the in-domain tuned systems, and by 15.30 points absolute (58.37 % relative) with respect to the general-domain baseline, all on average.</p>
<p>The pipeline for domain-focused web crawling described in this work proved to be very successful in acquisition of domain specific data—both monolingual and parallel. The experiments then showed a high impact of the acquired resources on domain adaptation of MT. We mainly concentrated on parameter tuning and analysis of its effects. Although we did not especially focuse on adaptation of language models and translation models, the acquired data also significantly improved these components and translation quality in general.</p>
</sec>
</body>
<back>
<app-group>
<app id="App1">
<sec id="Sec27">
<title>Appendix</title>
<sec id="Sec28">
<title>Domain definition: Environment</title>
<p>The environment domain refers to the interaction of humanity and the rest of the biophysical or natural environment. Relevant texts address the impacts of human activity on the natural environment, such as terrestrial, marine and atmospheric pollution, waste of natural resources (forests, mineral deposits, animal species) and climate change. Relevant texts also include laws, regulations and measures aiming to reduce the impacts of human activity on the natural environment and preserve ecosystems and biodiversity, which mainly refer to pollution control and remediation, legislation well as to resource conservation and management. Texts on natural disasters and their effects on social life are also relevant.</p>
</sec>
<sec id="Sec29">
<title>Domain definition: Labour legislation</title>
<p>The labour legislation domain consists of laws, rules, and regulations, which address the legal rights and obligations of workers and employers. Relevant texts refer to issues such as the determination of wages, working time, leave, working conditions, health and safety, as well as social security, retirement and compensation. It also refers to issues such as rights, obligations and actions of trade unions, as well as legal provisions concerning child labour, equality between men and women, work of immigrants and disabled persons. Relevant texts also discuss measures aiming to increase employment and worker mobility, to combat unemployment, poverty and social exclusion, to promote equal opportunities, to avoid discrimination of any kind and to improve social protection systems.</p>
<p>
<table-wrap id="Tab17">
<label>Table 17</label>
<caption>
<p>Complete results of all English–French systems</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Domain</th>
<th align="left">System</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq118">
<alternatives>
<tex-math id="M205">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M206">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq118.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">1-PER</th>
<th align="left">
<inline-formula id="IEq119">
<alternatives>
<tex-math id="M207">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M208">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq119.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">1-TER</th>
<th align="left">
<inline-formula id="IEq120">
<alternatives>
<tex-math id="M209">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M210">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq120.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="15">
<italic>env</italic>
</td>
<td align="left">
<italic>B0</italic>
</td>
<td char="." align="char">29.61</td>
<td char="." align="char">0.00</td>
<td char="." align="char">60.46</td>
<td char="." align="char">0.00</td>
<td char="." align="char">42.47</td>
<td char="." align="char">0.00</td>
</tr>
<tr>
<td align="left">
<italic>P1</italic>
</td>
<td char="." align="char">37.51</td>
<td char="." align="char">7.90</td>
<td char="." align="char">66.28</td>
<td char="." align="char">5.82</td>
<td char="." align="char">51.47</td>
<td char="." align="char">9.00</td>
</tr>
<tr>
<td align="left">
<italic>P2</italic>
</td>
<td char="." align="char">37.25</td>
<td char="." align="char">7.64</td>
<td char="." align="char">65.96</td>
<td char="." align="char">5.50</td>
<td char="." align="char">51.72</td>
<td char="." align="char">9.25</td>
</tr>
<tr>
<td align="left">
<italic>P3</italic>
</td>
<td char="." align="char">37.47</td>
<td char="." align="char">7.86</td>
<td char="." align="char">66.12</td>
<td char="." align="char">5.66</td>
<td char="." align="char">51.95</td>
<td char="." align="char">9.48</td>
</tr>
<tr>
<td align="left">
<italic>P4</italic>
</td>
<td char="." align="char">36.24</td>
<td char="." align="char">6.63</td>
<td char="." align="char">65.32</td>
<td char="." align="char">4.86</td>
<td char="." align="char">50.58</td>
<td char="." align="char">8.11</td>
</tr>
<tr>
<td align="left">
<italic>L1</italic>
</td>
<td char="." align="char">41.28</td>
<td char="." align="char">11.67</td>
<td char="." align="char">67.91</td>
<td char="." align="char">7.45</td>
<td char="." align="char">54.13</td>
<td char="." align="char">11.66</td>
</tr>
<tr>
<td align="left">
<italic>L2</italic>
</td>
<td char="." align="char">41.78</td>
<td char="." align="char">12.17</td>
<td char="." align="char">68.23</td>
<td char="." align="char">7.77</td>
<td char="." align="char">54.34</td>
<td char="." align="char">11.87</td>
</tr>
<tr>
<td align="left">
<italic>L3</italic>
</td>
<td char="." align="char">41.25</td>
<td char="." align="char">11.64</td>
<td char="." align="char">67.99</td>
<td char="." align="char">7.53</td>
<td char="." align="char">54.04</td>
<td char="." align="char">11.57</td>
</tr>
<tr>
<td align="left">
<italic>T1</italic>
</td>
<td char="." align="char">39.61</td>
<td char="." align="char">10.00</td>
<td char="." align="char">67.40</td>
<td char="." align="char">6.94</td>
<td char="." align="char">53.21</td>
<td char="." align="char">10.74</td>
</tr>
<tr>
<td align="left">
<italic>T2</italic>
</td>
<td char="." align="char">39.85</td>
<td char="." align="char">10.24</td>
<td char="." align="char">67.67</td>
<td char="." align="char">7.21</td>
<td char="." align="char">53.38</td>
<td char="." align="char">10.91</td>
</tr>
<tr>
<td align="left">
<italic>T3</italic>
</td>
<td char="." align="char">39.76</td>
<td char="." align="char">10.15</td>
<td char="." align="char">67.58</td>
<td char="." align="char">7.12</td>
<td char="." align="char">53.34</td>
<td char="." align="char">10.87</td>
</tr>
<tr>
<td align="left">
<italic>C1</italic>
</td>
<td char="." align="char">43.70</td>
<td char="." align="char">14.09</td>
<td char="." align="char">69.14</td>
<td char="." align="char">8.68</td>
<td char="." align="char">55.96</td>
<td char="." align="char">13.49</td>
</tr>
<tr>
<td align="left">
<italic>C2</italic>
</td>
<td char="." align="char">
<bold>43.85</bold>
</td>
<td char="." align="char">
<bold>14.24</bold>
</td>
<td char="." align="char">
<bold>69.52</bold>
</td>
<td char="." align="char">
<bold>9.06</bold>
</td>
<td char="." align="char">
<bold>56.12</bold>
</td>
<td char="." align="char">
<bold>13.65</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>C3</italic>
</td>
<td char="." align="char">43.75</td>
<td char="." align="char">14.14</td>
<td char="." align="char">69.49</td>
<td char="." align="char">9.03</td>
<td char="." align="char">55.78</td>
<td char="." align="char">13.31</td>
</tr>
<tr>
<td align="left">
<italic>C0</italic>
</td>
<td char="." align="char">39.54</td>
<td char="." align="char">9.93</td>
<td char="." align="char">66.72</td>
<td char="." align="char">6.26</td>
<td char="." align="char">52.15</td>
<td char="." align="char">9.68</td>
</tr>
<tr>
<td align="left" rowspan="15">
<italic>lab</italic>
</td>
<td align="left">
<italic>B0</italic>
</td>
<td char="." align="char">23.94</td>
<td char="." align="char">0.00</td>
<td char="." align="char">57.15</td>
<td char="." align="char">0.00</td>
<td char="." align="char">36.40</td>
<td char="." align="char">0.00</td>
</tr>
<tr>
<td align="left">
<italic>P1</italic>
</td>
<td char="." align="char">32.15</td>
<td char="." align="char">8.21</td>
<td char="." align="char">62.59</td>
<td char="." align="char">5.44</td>
<td char="." align="char">46.87</td>
<td char="." align="char">10.47</td>
</tr>
<tr>
<td align="left">
<italic>P2</italic>
</td>
<td char="." align="char">31.88</td>
<td char="." align="char">7.94</td>
<td char="." align="char">62.50</td>
<td char="." align="char">5.35</td>
<td char="." align="char">46.03</td>
<td char="." align="char">9.63</td>
</tr>
<tr>
<td align="left">
<italic>P3</italic>
</td>
<td char="." align="char">31.82</td>
<td char="." align="char">7.88</td>
<td char="." align="char">62.47</td>
<td char="." align="char">5.32</td>
<td char="." align="char">46.03</td>
<td char="." align="char">9.63</td>
</tr>
<tr>
<td align="left">
<italic>P4</italic>
</td>
<td char="." align="char">30.60</td>
<td char="." align="char">6.66</td>
<td char="." align="char">61.54</td>
<td char="." align="char">4.39</td>
<td char="." align="char">45.14</td>
<td char="." align="char">8.74</td>
</tr>
<tr>
<td align="left">
<italic>L1</italic>
</td>
<td char="." align="char">36.15</td>
<td char="." align="char">12.21</td>
<td char="." align="char">64.73</td>
<td char="." align="char">7.58</td>
<td char="." align="char">48.83</td>
<td char="." align="char">12.43</td>
</tr>
<tr>
<td align="left">
<italic>L2</italic>
</td>
<td char="." align="char">38.54</td>
<td char="." align="char">14.60</td>
<td char="." align="char">66.01</td>
<td char="." align="char">8.86</td>
<td char="." align="char">50.70</td>
<td char="." align="char">14.30</td>
</tr>
<tr>
<td align="left">
<italic>L3</italic>
</td>
<td char="." align="char">35.54</td>
<td char="." align="char">11.60</td>
<td char="." align="char">64.63</td>
<td char="." align="char">7.48</td>
<td char="." align="char">48.40</td>
<td char="." align="char">12.00</td>
</tr>
<tr>
<td align="left">
<italic>T1</italic>
</td>
<td char="." align="char">41.33</td>
<td char="." align="char">17.39</td>
<td char="." align="char">67.77</td>
<td char="." align="char">10.62</td>
<td char="." align="char">53.07</td>
<td char="." align="char">16.67</td>
</tr>
<tr>
<td align="left">
<italic>T2</italic>
</td>
<td char="." align="char">42.08</td>
<td char="." align="char">18.14</td>
<td char="." align="char">68.82</td>
<td char="." align="char">11.67</td>
<td char="." align="char">53.71</td>
<td char="." align="char">17.31</td>
</tr>
<tr>
<td align="left">
<italic>T3</italic>
</td>
<td char="." align="char">42.70</td>
<td char="." align="char">18.76</td>
<td char="." align="char">69.12</td>
<td char="." align="char">11.97</td>
<td char="." align="char">54.06</td>
<td char="." align="char">17.66</td>
</tr>
<tr>
<td align="left">
<italic>C1</italic>
</td>
<td char="." align="char">47.45</td>
<td char="." align="char">23.51</td>
<td char="." align="char">71.45</td>
<td char="." align="char">14.30</td>
<td char="." align="char">57.97</td>
<td char="." align="char">21.57</td>
</tr>
<tr>
<td align="left">
<italic>C2</italic>
</td>
<td char="." align="char">
<bold>48.31</bold>
</td>
<td char="." align="char">
<bold>24.37</bold>
</td>
<td char="." align="char">
<bold>71.94</bold>
</td>
<td char="." align="char">
<bold>14.79</bold>
</td>
<td char="." align="char">
<bold>58.89</bold>
</td>
<td char="." align="char">
<bold>22.49</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>C3</italic>
</td>
<td char="." align="char">47.96</td>
<td char="." align="char">24.02</td>
<td char="." align="char">71.64</td>
<td char="." align="char">14.49</td>
<td char="." align="char">58.57</td>
<td char="." align="char">22.17</td>
</tr>
<tr>
<td align="left">
<italic>C0</italic>
</td>
<td char="." align="char">43.05</td>
<td char="." align="char">19.11</td>
<td char="." align="char">69.14</td>
<td char="." align="char">11.99</td>
<td char="." align="char">54.63</td>
<td char="." align="char">18.23</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<inline-formula id="IEq117">
<alternatives>
<tex-math id="M211">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M212">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq117.gif"></inline-graphic>
</alternatives>
</inline-formula>
refers to absolute improvement over the baseline (
<italic>B0</italic>
)</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab18">
<label>Table 18</label>
<caption>
<p>Complete results of all French–English systems</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Domain</th>
<th align="left">System</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq122">
<alternatives>
<tex-math id="M213">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M214">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq122.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">1-PER</th>
<th align="left">
<inline-formula id="IEq123">
<alternatives>
<tex-math id="M215">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M216">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq123.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">1-TER</th>
<th align="left">
<inline-formula id="IEq124">
<alternatives>
<tex-math id="M217">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M218">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq124.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="15">
<italic>env</italic>
</td>
<td align="left">
<italic>B0</italic>
</td>
<td char="." align="char">31.79</td>
<td char="." align="char">0.00</td>
<td char="." align="char">63.13</td>
<td char="." align="char">0.00</td>
<td char="." align="char">47.79</td>
<td char="." align="char">0.00</td>
</tr>
<tr>
<td align="left">
<italic>P1</italic>
</td>
<td char="." align="char">39.05</td>
<td char="." align="char">7.26</td>
<td char="." align="char">70.60</td>
<td char="." align="char">7.47</td>
<td char="." align="char">55.64</td>
<td char="." align="char">7.85</td>
</tr>
<tr>
<td align="left">
<italic>P2</italic>
</td>
<td char="." align="char">38.93</td>
<td char="." align="char">7.14</td>
<td char="." align="char">70.55</td>
<td char="." align="char">7.42</td>
<td char="." align="char">55.55</td>
<td char="." align="char">7.76</td>
</tr>
<tr>
<td align="left">
<italic>P3</italic>
</td>
<td char="." align="char">38.79</td>
<td char="." align="char">7.00</td>
<td char="." align="char">69.40</td>
<td char="." align="char">6.27</td>
<td char="." align="char">55.21</td>
<td char="." align="char">7.42</td>
</tr>
<tr>
<td align="left">
<italic>P4</italic>
</td>
<td char="." align="char">34.05</td>
<td char="." align="char">2.26</td>
<td char="." align="char">59.36</td>
<td char="." align="char">−3.77</td>
<td char="." align="char">48.02</td>
<td char="." align="char">0.23</td>
</tr>
<tr>
<td align="left">
<italic>L1</italic>
</td>
<td char="." align="char">40.58</td>
<td char="." align="char">8.79</td>
<td char="." align="char">71.13</td>
<td char="." align="char">8.00</td>
<td char="." align="char">56.65</td>
<td char="." align="char">8.86</td>
</tr>
<tr>
<td align="left">
<italic>L2</italic>
</td>
<td char="." align="char">42.63</td>
<td char="." align="char">10.84</td>
<td char="." align="char">71.85</td>
<td char="." align="char">8.72</td>
<td char="." align="char">57.83</td>
<td char="." align="char">10.04</td>
</tr>
<tr>
<td align="left">
<italic>L3</italic>
</td>
<td char="." align="char">39.93</td>
<td char="." align="char">8.14</td>
<td char="." align="char">70.83</td>
<td char="." align="char">7.70</td>
<td char="." align="char">56.16</td>
<td char="." align="char">8.37</td>
</tr>
<tr>
<td align="left">
<italic>T1</italic>
</td>
<td char="." align="char">41.08</td>
<td char="." align="char">9.29</td>
<td char="." align="char">71.04</td>
<td char="." align="char">7.91</td>
<td char="." align="char">56.86</td>
<td char="." align="char">9.07</td>
</tr>
<tr>
<td align="left">
<italic>T2</italic>
</td>
<td char="." align="char">41.92</td>
<td char="." align="char">10.13</td>
<td char="." align="char">71.63</td>
<td char="." align="char">8.50</td>
<td char="." align="char">57.55</td>
<td char="." align="char">9.76</td>
</tr>
<tr>
<td align="left">
<italic>T3</italic>
</td>
<td char="." align="char">41.65</td>
<td char="." align="char">9.86</td>
<td char="." align="char">71.69</td>
<td char="." align="char">8.56</td>
<td char="." align="char">57.24</td>
<td char="." align="char">9.45</td>
</tr>
<tr>
<td align="left">
<italic>C1</italic>
</td>
<td char="." align="char">43.93</td>
<td char="." align="char">12.14</td>
<td char="." align="char">72.71</td>
<td char="." align="char">9.58</td>
<td char="." align="char">58.95</td>
<td char="." align="char">11.16</td>
</tr>
<tr>
<td align="left">
<italic>C2</italic>
</td>
<td char="." align="char">
<bold>44.22</bold>
</td>
<td char="." align="char">
<bold>12.43</bold>
</td>
<td char="." align="char">72.69</td>
<td char="." align="char">9.56</td>
<td char="." align="char">
<bold>59.00</bold>
</td>
<td char="." align="char">
<bold>11.21</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>C3</italic>
</td>
<td char="." align="char">44.12</td>
<td char="." align="char">12.33</td>
<td char="." align="char">
<bold>72.79</bold>
</td>
<td char="." align="char">
<bold>9.66</bold>
</td>
<td char="." align="char">58.98</td>
<td char="." align="char">11.19</td>
</tr>
<tr>
<td align="left">
<italic>C0</italic>
</td>
<td char="." align="char">37.86</td>
<td char="." align="char">6.07</td>
<td char="." align="char">68.71</td>
<td char="." align="char">5.58</td>
<td char="." align="char">53.58</td>
<td char="." align="char">5.79</td>
</tr>
<tr>
<td align="left" rowspan="15">
<italic>lab</italic>
</td>
<td align="left">
<italic>B0</italic>
</td>
<td char="." align="char">26.96</td>
<td char="." align="char">0.00</td>
<td char="." align="char">59.94</td>
<td char="." align="char">0.00</td>
<td char="." align="char">43.04</td>
<td char="." align="char">0.00</td>
</tr>
<tr>
<td align="left">
<italic>P1</italic>
</td>
<td char="." align="char">33.48</td>
<td char="." align="char">6.52</td>
<td char="." align="char">66.60</td>
<td char="." align="char">6.66</td>
<td char="." align="char">50.41</td>
<td char="." align="char">7.37</td>
</tr>
<tr>
<td align="left">
<italic>P2</italic>
</td>
<td char="." align="char">33.34</td>
<td char="." align="char">6.38</td>
<td char="." align="char">66.55</td>
<td char="." align="char">6.61</td>
<td char="." align="char">50.33</td>
<td char="." align="char">7.29</td>
</tr>
<tr>
<td align="left">
<italic>P3</italic>
</td>
<td char="." align="char">33.07</td>
<td char="." align="char">6.11</td>
<td char="." align="char">66.58</td>
<td char="." align="char">6.64</td>
<td char="." align="char">50.90</td>
<td char="." align="char">7.86</td>
</tr>
<tr>
<td align="left">
<italic>P4</italic>
</td>
<td char="." align="char">29.69</td>
<td char="." align="char">2.73</td>
<td char="." align="char">56.85</td>
<td char="." align="char">−3.09</td>
<td char="." align="char">43.88</td>
<td char="." align="char">0.84</td>
</tr>
<tr>
<td align="left">
<italic>L1</italic>
</td>
<td char="." align="char">38.05</td>
<td char="." align="char">11.09</td>
<td char="." align="char">68.74</td>
<td char="." align="char">8.80</td>
<td char="." align="char">53.54</td>
<td char="." align="char">10.50</td>
</tr>
<tr>
<td align="left">
<italic>L2</italic>
</td>
<td char="." align="char">41.11</td>
<td char="." align="char">14.15</td>
<td char="." align="char">70.19</td>
<td char="." align="char">10.25</td>
<td char="." align="char">55.42</td>
<td char="." align="char">12.38</td>
</tr>
<tr>
<td align="left">
<italic>L3</italic>
</td>
<td char="." align="char">33.95</td>
<td char="." align="char">6.99</td>
<td char="." align="char">60.46</td>
<td char="." align="char">0.52</td>
<td char="." align="char">47.78</td>
<td char="." align="char">4.74</td>
</tr>
<tr>
<td align="left">
<italic>T1</italic>
</td>
<td char="." align="char">43.54</td>
<td char="." align="char">16.58</td>
<td char="." align="char">72.08</td>
<td char="." align="char">12.14</td>
<td char="." align="char">57.50</td>
<td char="." align="char">14.46</td>
</tr>
<tr>
<td align="left">
<italic>T2</italic>
</td>
<td char="." align="char">45.06</td>
<td char="." align="char">18.10</td>
<td char="." align="char">73.28</td>
<td char="." align="char">13.34</td>
<td char="." align="char">59.03</td>
<td char="." align="char">15.99</td>
</tr>
<tr>
<td align="left">
<italic>T3</italic>
</td>
<td char="." align="char">45.12</td>
<td char="." align="char">18.16</td>
<td char="." align="char">73.30</td>
<td char="." align="char">13.36</td>
<td char="." align="char">59.03</td>
<td char="." align="char">15.99</td>
</tr>
<tr>
<td align="left">
<italic>C1</italic>
</td>
<td char="." align="char">50.07</td>
<td char="." align="char">23.11</td>
<td char="." align="char">75.34</td>
<td char="." align="char">15.40</td>
<td char="." align="char">62.66</td>
<td char="." align="char">19.62</td>
</tr>
<tr>
<td align="left">
<italic>C2</italic>
</td>
<td char="." align="char">
<bold>50.56</bold>
</td>
<td char="." align="char">
<bold>23.60</bold>
</td>
<td char="." align="char">
<bold>75.71</bold>
</td>
<td char="." align="char">
<bold>15.77</bold>
</td>
<td char="." align="char">
<bold>63.13</bold>
</td>
<td char="." align="char">
<bold>20.09</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>C3</italic>
</td>
<td char="." align="char">50.34</td>
<td char="." align="char">23.38</td>
<td char="." align="char">
<bold>75.71</bold>
</td>
<td char="." align="char">
<bold>15.77</bold>
</td>
<td char="." align="char">63.10</td>
<td char="." align="char">20.06</td>
</tr>
<tr>
<td align="left">
<italic>C0</italic>
</td>
<td char="." align="char">43.74</td>
<td char="." align="char">16.78</td>
<td char="." align="char">72.07</td>
<td char="." align="char">12.13</td>
<td char="." align="char">57.75</td>
<td char="." align="char">14.71</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<inline-formula id="IEq121">
<alternatives>
<tex-math id="M219">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M220">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq121.gif"></inline-graphic>
</alternatives>
</inline-formula>
refers to absolute improvement over the baseline (
<italic>B0</italic>
)</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab19">
<label>Table 19</label>
<caption>
<p>Complete results of all English–Greek systems</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Domain</th>
<th align="left">System</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq126">
<alternatives>
<tex-math id="M221">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M222">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq126.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">1-PER</th>
<th align="left">
<inline-formula id="IEq127">
<alternatives>
<tex-math id="M223">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M224">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq127.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">1-TER</th>
<th align="left">
<inline-formula id="IEq128">
<alternatives>
<tex-math id="M225">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M226">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq128.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="15">
<italic>env</italic>
</td>
<td align="left">
<italic>B0</italic>
</td>
<td char="." align="char">21.20</td>
<td char="." align="char">0.00</td>
<td char="." align="char">52.75</td>
<td char="." align="char">0.00</td>
<td char="." align="char">36.76</td>
<td char="." align="char">0.00</td>
</tr>
<tr>
<td align="left">
<italic>P1</italic>
</td>
<td char="." align="char">27.56</td>
<td char="." align="char">6.36</td>
<td char="." align="char">57.65</td>
<td char="." align="char">4.90</td>
<td char="." align="char">43.48</td>
<td char="." align="char">6.72</td>
</tr>
<tr>
<td align="left">
<italic>P2</italic>
</td>
<td char="." align="char">27.29</td>
<td char="." align="char">6.09</td>
<td char="." align="char">57.25</td>
<td char="." align="char">4.50</td>
<td char="." align="char">44.00</td>
<td char="." align="char">7.24</td>
</tr>
<tr>
<td align="left">
<italic>P3</italic>
</td>
<td char="." align="char">27.26</td>
<td char="." align="char">6.06</td>
<td char="." align="char">57.04</td>
<td char="." align="char">4.29</td>
<td char="." align="char">44.44</td>
<td char="." align="char">7.68</td>
</tr>
<tr>
<td align="left">
<italic>P4</italic>
</td>
<td char="." align="char">27.16</td>
<td char="." align="char">5.96</td>
<td char="." align="char">57.03</td>
<td char="." align="char">4.28</td>
<td char="." align="char">43.94</td>
<td char="." align="char">7.18</td>
</tr>
<tr>
<td align="left">
<italic>L1</italic>
</td>
<td char="." align="char">33.59</td>
<td char="." align="char">12.39</td>
<td char="." align="char">61.07</td>
<td char="." align="char">8.32</td>
<td char="." align="char">47.48</td>
<td char="." align="char">10.72</td>
</tr>
<tr>
<td align="left">
<italic>L2</italic>
</td>
<td char="." align="char">34.89</td>
<td char="." align="char">13.69</td>
<td char="." align="char">61.52</td>
<td char="." align="char">8.77</td>
<td char="." align="char">49.82</td>
<td char="." align="char">13.06</td>
</tr>
<tr>
<td align="left">
<italic>L3</italic>
</td>
<td char="." align="char">33.65</td>
<td char="." align="char">12.45</td>
<td char="." align="char">60.98</td>
<td char="." align="char">8.23</td>
<td char="." align="char">47.74</td>
<td char="." align="char">10.98</td>
</tr>
<tr>
<td align="left">
<italic>T1</italic>
</td>
<td char="." align="char">30.73</td>
<td char="." align="char">9.53</td>
<td char="." align="char">58.68</td>
<td char="." align="char">5.93</td>
<td char="." align="char">44.91</td>
<td char="." align="char">8.15</td>
</tr>
<tr>
<td align="left">
<italic>T2</italic>
</td>
<td char="." align="char">30.74</td>
<td char="." align="char">9.54</td>
<td char="." align="char">58.99</td>
<td char="." align="char">6.24</td>
<td char="." align="char">45.29</td>
<td char="." align="char">8.53</td>
</tr>
<tr>
<td align="left">
<italic>T3</italic>
</td>
<td char="." align="char">31.89</td>
<td char="." align="char">10.69</td>
<td char="." align="char">59.71</td>
<td char="." align="char">6.96</td>
<td char="." align="char">46.18</td>
<td char="." align="char">9.42</td>
</tr>
<tr>
<td align="left">
<italic>C1</italic>
</td>
<td char="." align="char">
<bold>38.41</bold>
</td>
<td char="." align="char">
<bold>17.21</bold>
</td>
<td char="." align="char">
<bold>63.99</bold>
</td>
<td char="." align="char">
<bold>11.24</bold>
</td>
<td char="." align="char">51.09</td>
<td char="." align="char">14.33</td>
</tr>
<tr>
<td align="left">
<italic>C2</italic>
</td>
<td char="." align="char">37.90</td>
<td char="." align="char">16.70</td>
<td char="." align="char">63.27</td>
<td char="." align="char">10.52</td>
<td char="." align="char">
<bold>51.61</bold>
</td>
<td char="." align="char">
<bold>14.85</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>C3</italic>
</td>
<td char="." align="char">38.22</td>
<td char="." align="char">17.02</td>
<td char="." align="char">63.89</td>
<td char="." align="char">11.14</td>
<td char="." align="char">51.28</td>
<td char="." align="char">14.52</td>
</tr>
<tr>
<td align="left">
<italic>C0</italic>
</td>
<td char="." align="char">29.84</td>
<td char="." align="char">8.64</td>
<td char="." align="char">57.15</td>
<td char="." align="char">4.40</td>
<td char="." align="char">42.89</td>
<td char="." align="char">6.13</td>
</tr>
<tr>
<td align="left" rowspan="15">
<italic>lab</italic>
</td>
<td align="left">
<italic>B0</italic>
</td>
<td char="." align="char">24.04</td>
<td char="." align="char">0.00</td>
<td char="." align="char">53.69</td>
<td char="." align="char">0.00</td>
<td char="." align="char">38.79</td>
<td char="." align="char">0.00</td>
</tr>
<tr>
<td align="left">
<italic>P1</italic>
</td>
<td char="." align="char">30.07</td>
<td char="." align="char">6.03</td>
<td char="." align="char">59.66</td>
<td char="." align="char">5.97</td>
<td char="." align="char">46.17</td>
<td char="." align="char">7.38</td>
</tr>
<tr>
<td align="left">
<italic>P2</italic>
</td>
<td char="." align="char">30.23</td>
<td char="." align="char">6.19</td>
<td char="." align="char">59.67</td>
<td char="." align="char">5.98</td>
<td char="." align="char">46.15</td>
<td char="." align="char">7.36</td>
</tr>
<tr>
<td align="left">
<italic>P3</italic>
</td>
<td char="." align="char">29.68</td>
<td char="." align="char">5.64</td>
<td char="." align="char">57.71</td>
<td char="." align="char">4.02</td>
<td char="." align="char">44.95</td>
<td char="." align="char">6.16</td>
</tr>
<tr>
<td align="left">
<italic>P4</italic>
</td>
<td char="." align="char">29.76</td>
<td char="." align="char">5.72</td>
<td char="." align="char">58.73</td>
<td char="." align="char">5.04</td>
<td char="." align="char">45.59</td>
<td char="." align="char">6.80</td>
</tr>
<tr>
<td align="left">
<italic>L1</italic>
</td>
<td char="." align="char">35.09</td>
<td char="." align="char">11.05</td>
<td char="." align="char">62.35</td>
<td char="." align="char">8.66</td>
<td char="." align="char">49.55</td>
<td char="." align="char">10.76</td>
</tr>
<tr>
<td align="left">
<italic>L2</italic>
</td>
<td char="." align="char">34.15</td>
<td char="." align="char">10.11</td>
<td char="." align="char">61.90</td>
<td char="." align="char">8.21</td>
<td char="." align="char">48.74</td>
<td char="." align="char">9.95</td>
</tr>
<tr>
<td align="left">
<italic>L3</italic>
</td>
<td char="." align="char">34.33</td>
<td char="." align="char">10.29</td>
<td char="." align="char">61.95</td>
<td char="." align="char">8.26</td>
<td char="." align="char">48.78</td>
<td char="." align="char">9.99</td>
</tr>
<tr>
<td align="left">
<italic>T1</italic>
</td>
<td char="." align="char">30.48</td>
<td char="." align="char">6.44</td>
<td char="." align="char">60.11</td>
<td char="." align="char">6.42</td>
<td char="." align="char">46.79</td>
<td char="." align="char">8.00</td>
</tr>
<tr>
<td align="left">
<italic>T2</italic>
</td>
<td char="." align="char">30.51</td>
<td char="." align="char">6.47</td>
<td char="." align="char">60.19</td>
<td char="." align="char">6.50</td>
<td char="." align="char">46.63</td>
<td char="." align="char">7.84</td>
</tr>
<tr>
<td align="left">
<italic>T3</italic>
</td>
<td char="." align="char">30.51</td>
<td char="." align="char">6.47</td>
<td char="." align="char">59.99</td>
<td char="." align="char">6.30</td>
<td char="." align="char">46.43</td>
<td char="." align="char">7.64</td>
</tr>
<tr>
<td align="left">
<italic>C1</italic>
</td>
<td char="." align="char">34.29</td>
<td char="." align="char">10.25</td>
<td char="." align="char">62.03</td>
<td char="." align="char">8.34</td>
<td char="." align="char">49.08</td>
<td char="." align="char">10.29</td>
</tr>
<tr>
<td align="left">
<italic>C2</italic>
</td>
<td char="." align="char">
<bold>34.76</bold>
</td>
<td char="." align="char">
<bold>10.72</bold>
</td>
<td char="." align="char">
<bold>62.40</bold>
</td>
<td char="." align="char">
<bold>8.71</bold>
</td>
<td char="." align="char">
<bold>49.74</bold>
</td>
<td char="." align="char">
<bold>10.95</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>C3</italic>
</td>
<td char="." align="char">34.48</td>
<td char="." align="char">10.44</td>
<td char="." align="char">61.77</td>
<td char="." align="char">8.08</td>
<td char="." align="char">48.58</td>
<td char="." align="char">9.79</td>
</tr>
<tr>
<td align="left">
<italic>C0</italic>
</td>
<td char="." align="char">26.19</td>
<td char="." align="char">2.15</td>
<td char="." align="char">55.05</td>
<td char="." align="char">1.36</td>
<td char="." align="char">40.57</td>
<td char="." align="char">1.78</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<inline-formula id="IEq125">
<alternatives>
<tex-math id="M227">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M228">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq125.gif"></inline-graphic>
</alternatives>
</inline-formula>
refers to absolute improvement over the baseline (
<italic>B0</italic>
)</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab20">
<label>Table 20</label>
<caption>
<p>Complete results of all Greek–English systems</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Domain</th>
<th align="left">System</th>
<th align="left">BLEU</th>
<th align="left">
<inline-formula id="IEq130">
<alternatives>
<tex-math id="M229">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M230">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq130.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">1-PER</th>
<th align="left">
<inline-formula id="IEq131">
<alternatives>
<tex-math id="M231">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M232">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq131.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">1-TER</th>
<th align="left">
<inline-formula id="IEq132">
<alternatives>
<tex-math id="M233">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M234">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq132.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="15">
<italic>env</italic>
</td>
<td align="left">
<italic>B0</italic>
</td>
<td char="." align="char">29.31</td>
<td char="." align="char">0.00</td>
<td char="." align="char">62.98</td>
<td char="." align="char">0.00</td>
<td char="." align="char">49.00</td>
<td char="." align="char">0.00</td>
</tr>
<tr>
<td align="left">
<italic>P1</italic>
</td>
<td char="." align="char">34.31</td>
<td char="." align="char">5.00</td>
<td char="." align="char">67.62</td>
<td char="." align="char">4.64</td>
<td char="." align="char">52.70</td>
<td char="." align="char">3.70</td>
</tr>
<tr>
<td align="left">
<italic>P2</italic>
</td>
<td char="." align="char">34.32</td>
<td char="." align="char">5.01</td>
<td char="." align="char">67.40</td>
<td char="." align="char">4.42</td>
<td char="." align="char">52.49</td>
<td char="." align="char">3.49</td>
</tr>
<tr>
<td align="left">
<italic>P3</italic>
</td>
<td char="." align="char">33.98</td>
<td char="." align="char">4.67</td>
<td char="." align="char">66.59</td>
<td char="." align="char">3.61</td>
<td char="." align="char">52.06</td>
<td char="." align="char">3.06</td>
</tr>
<tr>
<td align="left">
<italic>P4</italic>
</td>
<td char="." align="char">31.45</td>
<td char="." align="char">2.14</td>
<td char="." align="char">57.50</td>
<td char="." align="char">-5.48</td>
<td char="." align="char">46.29</td>
<td char="." align="char">−2.71</td>
</tr>
<tr>
<td align="left">
<italic>L1</italic>
</td>
<td char="." align="char">37.03</td>
<td char="." align="char">7.72</td>
<td char="." align="char">68.67</td>
<td char="." align="char">5.69</td>
<td char="." align="char">54.31</td>
<td char="." align="char">5.31</td>
</tr>
<tr>
<td align="left">
<italic>L2</italic>
</td>
<td char="." align="char">37.57</td>
<td char="." align="char">8.26</td>
<td char="." align="char">68.93</td>
<td char="." align="char">5.95</td>
<td char="." align="char">54.38</td>
<td char="." align="char">5.38</td>
</tr>
<tr>
<td align="left">
<italic>L3</italic>
</td>
<td char="." align="char">36.55</td>
<td char="." align="char">7.24</td>
<td char="." align="char">68.62</td>
<td char="." align="char">5.64</td>
<td char="." align="char">53.95</td>
<td char="." align="char">4.95</td>
</tr>
<tr>
<td align="left">
<italic>T1</italic>
</td>
<td char="." align="char">38.35</td>
<td char="." align="char">9.04</td>
<td char="." align="char">69.00</td>
<td char="." align="char">6.02</td>
<td char="." align="char">55.15</td>
<td char="." align="char">6.15</td>
</tr>
<tr>
<td align="left">
<italic>T2</italic>
</td>
<td char="." align="char">38.12</td>
<td char="." align="char">8.81</td>
<td char="." align="char">69.31</td>
<td char="." align="char">6.33</td>
<td char="." align="char">54.92</td>
<td char="." align="char">5.92</td>
</tr>
<tr>
<td align="left">
<italic>T3</italic>
</td>
<td char="." align="char">38.66</td>
<td char="." align="char">9.35</td>
<td char="." align="char">69.49</td>
<td char="." align="char">6.51</td>
<td char="." align="char">55.26</td>
<td char="." align="char">6.26</td>
</tr>
<tr>
<td align="left">
<italic>C1</italic>
</td>
<td char="." align="char">
<bold>40.85</bold>
</td>
<td char="." align="char">
<bold>11.54</bold>
</td>
<td char="." align="char">
<bold>70.59</bold>
</td>
<td char="." align="char">
<bold>7.61</bold>
</td>
<td char="." align="char">56.82</td>
<td char="." align="char">7.82</td>
</tr>
<tr>
<td align="left">
<italic>C2</italic>
</td>
<td char="." align="char">40.64</td>
<td char="." align="char">11.33</td>
<td char="." align="char">70.40</td>
<td char="." align="char">7.42</td>
<td char="." align="char">56.85</td>
<td char="." align="char">7.85</td>
</tr>
<tr>
<td align="left">
<italic>C3</italic>
</td>
<td char="." align="char">40.81</td>
<td char="." align="char">11.50</td>
<td char="." align="char">70.47</td>
<td char="." align="char">7.49</td>
<td char="." align="char">
<bold>56.97</bold>
</td>
<td char="." align="char">
<bold>7.97</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>C0</italic>
</td>
<td char="." align="char">30.71</td>
<td char="." align="char">1.40</td>
<td char="." align="char">62.76</td>
<td char="." align="char">−0.22</td>
<td char="." align="char">47.26</td>
<td char="." align="char">−1.74</td>
</tr>
<tr>
<td align="left" rowspan="15">
<italic>lab</italic>
</td>
<td align="left">
<italic>B0</italic>
</td>
<td char="." align="char">31.73</td>
<td char="." align="char">0.00</td>
<td char="." align="char">64.48</td>
<td char="." align="char">0.00</td>
<td char="." align="char">51.39</td>
<td char="." align="char">0.00</td>
</tr>
<tr>
<td align="left">
<italic>P1</italic>
</td>
<td char="." align="char">37.57</td>
<td char="." align="char">5.84</td>
<td char="." align="char">69.53</td>
<td char="." align="char">5.05</td>
<td char="." align="char">54.93</td>
<td char="." align="char">3.54</td>
</tr>
<tr>
<td align="left">
<italic>P2</italic>
</td>
<td char="." align="char">37.68</td>
<td char="." align="char">5.95</td>
<td char="." align="char">69.39</td>
<td char="." align="char">4.91</td>
<td char="." align="char">55.22</td>
<td char="." align="char">3.83</td>
</tr>
<tr>
<td align="left">
<italic>P3</italic>
</td>
<td char="." align="char">37.58</td>
<td char="." align="char">5.85</td>
<td char="." align="char">69.21</td>
<td char="." align="char">4.73</td>
<td char="." align="char">55.55</td>
<td char="." align="char">4.16</td>
</tr>
<tr>
<td align="left">
<italic>P4</italic>
</td>
<td char="." align="char">34.95</td>
<td char="." align="char">3.22</td>
<td char="." align="char">61.40</td>
<td char="." align="char">−3.08</td>
<td char="." align="char">50.20</td>
<td char="." align="char">−1.19</td>
</tr>
<tr>
<td align="left">
<italic>L1</italic>
</td>
<td char="." align="char">40.15</td>
<td char="." align="char">8.42</td>
<td char="." align="char">70.44</td>
<td char="." align="char">5.96</td>
<td char="." align="char">56.82</td>
<td char="." align="char">5.43</td>
</tr>
<tr>
<td align="left">
<italic>L2</italic>
</td>
<td char="." align="char">40.09</td>
<td char="." align="char">8.36</td>
<td char="." align="char">70.50</td>
<td char="." align="char">6.02</td>
<td char="." align="char">56.45</td>
<td char="." align="char">5.06</td>
</tr>
<tr>
<td align="left">
<italic>L3</italic>
</td>
<td char="." align="char">40.01</td>
<td char="." align="char">8.28</td>
<td char="." align="char">70.53</td>
<td char="." align="char">6.05</td>
<td char="." align="char">56.48</td>
<td char="." align="char">5.09</td>
</tr>
<tr>
<td align="left">
<italic>T1</italic>
</td>
<td char="." align="char">38.07</td>
<td char="." align="char">6.34</td>
<td char="." align="char">69.78</td>
<td char="." align="char">5.30</td>
<td char="." align="char">55.25</td>
<td char="." align="char">3.86</td>
</tr>
<tr>
<td align="left">
<italic>T2</italic>
</td>
<td char="." align="char">38.20</td>
<td char="." align="char">6.47</td>
<td char="." align="char">69.67</td>
<td char="." align="char">5.19</td>
<td char="." align="char">55.26</td>
<td char="." align="char">3.87</td>
</tr>
<tr>
<td align="left">
<italic>T3</italic>
</td>
<td char="." align="char">37.90</td>
<td char="." align="char">6.17</td>
<td char="." align="char">69.52</td>
<td char="." align="char">5.04</td>
<td char="." align="char">55.05</td>
<td char="." align="char">3.66</td>
</tr>
<tr>
<td align="left">
<italic>C1</italic>
</td>
<td char="." align="char">40.69</td>
<td char="." align="char">8.96</td>
<td char="." align="char">70.67</td>
<td char="." align="char">6.19</td>
<td char="." align="char">56.77</td>
<td char="." align="char">5.38</td>
</tr>
<tr>
<td align="left">
<italic>C2</italic>
</td>
<td char="." align="char">
<bold>40.75</bold>
</td>
<td char="." align="char">
<bold>9.02</bold>
</td>
<td char="." align="char">
<bold>70.89</bold>
</td>
<td char="." align="char">
<bold>6.41</bold>
</td>
<td char="." align="char">
<bold>57.04</bold>
</td>
<td char="." align="char">
<bold>5.65</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>C3</italic>
</td>
<td char="." align="char">40.62</td>
<td char="." align="char">8.89</td>
<td char="." align="char">70.76</td>
<td char="." align="char">6.28</td>
<td char="." align="char">56.91</td>
<td char="." align="char">5.52</td>
</tr>
<tr>
<td align="left">
<italic>C0</italic>
</td>
<td char="." align="char">29.54</td>
<td char="." align="char">−2.19</td>
<td char="." align="char">61.76</td>
<td char="." align="char">−2.72</td>
<td char="." align="char">46.52</td>
<td char="." align="char">−4.87</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>
<inline-formula id="IEq129">
<alternatives>
<tex-math id="M235">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varDelta $$\end{document}</tex-math>
<mml:math id="M236">
<mml:mi mathvariant="italic">Δ</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9282_Article_IEq129.gif"></inline-graphic>
</alternatives>
</inline-formula>
refers to absolute improvement over the baseline (
<italic>B0</italic>
)</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
</sec>
</app>
</app-group>
<fn-group>
<fn id="Fn1">
<label>1</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://www.panacea-lr.eu/">http://www.panacea-lr.eu/</ext-link>
.</p>
</fn>
<fn id="Fn2">
<label>2</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://combine.it.lth.se/">http://combine.it.lth.se/</ext-link>
.</p>
</fn>
<fn id="Fn3">
<label>3</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://crawler.archive.org/">http://crawler.archive.org/</ext-link>
.</p>
</fn>
<fn id="Fn4">
<label>4</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://nutch.apache.org/">http://nutch.apache.org/</ext-link>
.</p>
</fn>
<fn id="Fn5">
<label>5</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://openbixo.org/">http://openbixo.org/</ext-link>
.</p>
</fn>
<fn id="Fn6">
<label>6</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://www.httrack.com/">http://www.httrack.com/</ext-link>
.</p>
</fn>
<fn id="Fn7">
<label>7</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://eurovoc.europa.eu/">http://eurovoc.europa.eu/</ext-link>
.</p>
</fn>
<fn id="Fn8">
<label>8</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://hadoop.apache.org/">http://hadoop.apache.org/</ext-link>
.</p>
</fn>
<fn id="Fn9">
<label>9</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://registry.elda.org/services/160">http://registry.elda.org/services/160</ext-link>
.</p>
</fn>
<fn id="Fn10">
<label>10</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://dmoz.org/">http://dmoz.org/</ext-link>
.</p>
</fn>
<fn id="Fn11">
<label>11</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://dir.yahoo.com/">http://dir.yahoo.com/</ext-link>
.</p>
</fn>
<fn id="Fn12">
<label>12</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://tika.apache.org/">http://tika.apache.org/</ext-link>
.</p>
</fn>
<fn id="Fn13">
<label>13</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://catalog.elra.info/">http://catalog.elra.info/</ext-link>
.</p>
</fn>
<fn id="Fn14">
<label>14</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://registry.elda.org/services/127">http://registry.elda.org/services/127</ext-link>
.</p>
</fn>
<fn id="Fn15">
<label>15</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://www.statmt.org/europarl/">http://www.statmt.org/europarl/</ext-link>
.</p>
</fn>
<fn id="Fn16">
<label>16</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://www.statmt.org/wpt05/">http://www.statmt.org/wpt05/</ext-link>
.</p>
</fn>
<fn id="Fn17">
<label>17</label>
<p>
<ext-link ext-link-type="uri" xlink:href="http://www.statmt.org/wmt11/">http://www.statmt.org/wmt11/</ext-link>
.</p>
</fn>
<fn id="Fn18">
<label>18</label>
<p>Note that the minimum number of development sentences is not strictly given, although we address this issue in Sect. 
<xref rid="Sec21" ref-type="sec">5.5</xref>
. The only requirement is that the optimisation procedure (MERT in our case) must converge, which might not happen if the set is too small or somehow unbalanced.</p>
</fn>
</fn-group>
<ack>
<p>This work has been supported by the 7th Framework Research Programme of the European Union projects PANACEA (Contract No. 248064), Khresmoi (Contract No. 257528), and Abu-MaTran (Contract No. 324414), the Czech Science Foundation (Grant No. P103/12/G084), and the Science Foundation Ireland (Grant No. 07/CE/I1142) as part of the Centre for Next Generation Localisation at Dublin City University. We would like to thank all the partners of the PANACEA project for their help and support, especially Victoria Arranz, Olivier Hamon, and Khalid Choukri from the Evaluations and Language Resources Distribution Agency (ELDA), Paris, France, who contributed to the manual correction of French–English parallel data, and Maria Giagkou and Voula Giouli from the Institute for Language and Speech Processing / Athena RIC, Athens, Greece for their help in the construction of the domain definitions and in the manual correction of Greek–English parallel data.</p>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<element-citation publication-type="book">
<person-group person-group-type="author">
<name>
<surname>Ardö</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Golub</surname>
<given-names>K</given-names>
</name>
</person-group>
<source>Focused crawler software package</source>
<year>2007</year>
<publisher-loc>Sweden</publisher-loc>
<publisher-name>Tech. rep., Department of Information Technology, Lund University</publisher-name>
</element-citation>
</ref>
<ref id="CR2">
<mixed-citation publication-type="other">Axelrod, A., He, X., & Gao, J. (2011). Domain adaptation via pseudo in-domain data selection. In
<italic>Proceedings of the conference on empirical methods in natural language processing</italic>
. Edinburgh, United Kingdom, pp. 355–362.</mixed-citation>
</ref>
<ref id="CR3">
<mixed-citation publication-type="other">Banerjee, P., Du, J., Li, B., Naskar, S., Way, A., & van Genabith, J. (2010). Combining multi-domain statistical machine translation models using automatic classifiers. In
<italic>Proceedings of the ninth conference of the association for machine translation in the Americas</italic>
. Denver, Colorado, USA, pp. 141–150.</mixed-citation>
</ref>
<ref id="CR4">
<mixed-citation publication-type="other">Banerjee, P., Naskar, S.K., Roturier, J., Way, A., & van Genabith, J. (2011). Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In
<italic>Proceedings of the machine translation summit XIII</italic>
. Xiamen, China, pp. 285–292.</mixed-citation>
</ref>
<ref id="CR5">
<mixed-citation publication-type="other">Banerjee, P., Rubino, R., Roturier, J., & van Genabith, J. (2013). Quality estimation-guided data selection for domain adaptation of smt. In
<italic>Proceedings of the XIV machine translation summit</italic>
. Nice, France, pp. 101–108.</mixed-citation>
</ref>
<ref id="CR6">
<mixed-citation publication-type="other">Barbosa, L., Rangarajan Sridhar, V.K., Yarmohammadi, M., & Bangalore, S. (2012). Harvesting parallel text in multiple languages with limited supervision. In
<italic>Proceedings of the 24th international conference on computational linguistics</italic>
. Mumbai, India, pp. 201–214.</mixed-citation>
</ref>
<ref id="CR7">
<mixed-citation publication-type="other">Baroni, M., Kilgarriff, A., Pomikálek, J., & Rychlý, P. (2006). WebBootCaT: Instant domain-specific corpora to support human translators. In
<italic>Proceedings of the 11th annual conference of the european association for machine translation</italic>
. Oslo, Norway, pp. 47–252.</mixed-citation>
</ref>
<ref id="CR8">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baroni</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Bernardini</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Ferraresi</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Zanchetta</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora</article-title>
<source>Language Resources and Evaluation</source>
<year>2009</year>
<volume>43</volume>
<issue>3</issue>
<fpage>209</fpage>
<lpage>226</lpage>
<pub-id pub-id-type="doi">10.1007/s10579-009-9081-4</pub-id>
</element-citation>
</ref>
<ref id="CR9">
<mixed-citation publication-type="other">Bergmark, D., Lagoze, C., & Sbityakov, A. (2002). Focused crawls, tunneling, and digital libraries. In M. Agosti & C. Thanos (Eds.),
<italic>Research and advanced technology for digital libraries, lecture notes in computer science</italic>
. Berlin: Heidelberg, Vol. 2458, pp. 49–70.</mixed-citation>
</ref>
<ref id="CR10">
<mixed-citation publication-type="other">Bertoldi, N., & Federico, M. (2009). Domain adaptation for statistical machine translation with monolingual resources. In
<italic>Proceedings of the fourth workshop on statistical machine translation</italic>
. Athens, Greece, pp. 182–189.</mixed-citation>
</ref>
<ref id="CR11">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bertoldi</surname>
<given-names>N</given-names>
</name>
<name>
<surname>Haddow</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Fouet</surname>
<given-names>JB</given-names>
</name>
</person-group>
<article-title>Improved minimum error rate training in Moses</article-title>
<source>The Prague Bulletin of Mathematical Linguistics</source>
<year>2009</year>
<volume>91</volume>
<fpage>7</fpage>
<lpage>16</lpage>
<pub-id pub-id-type="doi">10.2478/v10108-009-0011-9</pub-id>
</element-citation>
</ref>
<ref id="CR12">
<mixed-citation publication-type="other">Bisazza, A., Ruiz, N., & Federico, M. (2011). Fill-up versus interpolation methods for phrase-based SMT adaptation. In
<italic>Proceedings of the international workshop on spoken language translation</italic>
. San Francisco, California, USA, pp. 136–143.</mixed-citation>
</ref>
<ref id="CR13">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brin</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Page</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>The anatomy of a large-scale hypertextual Web search engine</article-title>
<source>Computer Networks and ISDN Systems</source>
<year>1998</year>
<volume>30</volume>
<fpage>107</fpage>
<lpage>117</lpage>
<pub-id pub-id-type="doi">10.1016/S0169-7552(98)00110-X</pub-id>
</element-citation>
</ref>
<ref id="CR14">
<mixed-citation publication-type="other">Carpuat, M., & Wu, D. (2007). Improving statistical machine translation using word sense disambiguation. In
<italic>Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning</italic>
. Prague, Czech Republic, pp. 61–72.</mixed-citation>
</ref>
<ref id="CR15">
<mixed-citation publication-type="other">Carpuat, M., Daumé III, H., Fraser, A., Quirk, C., Braune, F., Clifton, A., et al. (2012). Domain adaptation in machine translation: Final report. In
<italic>2012 Johns Hopkins summer workshop final report</italic>
. Baltimore, MD: Johns Hopkins University.</mixed-citation>
</ref>
<ref id="CR16">
<mixed-citation publication-type="other">Chen, J., Chau, R., & Yeh, C.H. (2004). Discovering parallel text from the World Wide Web. In
<italic>Proceedings of the 2nd workshop on Australasian information security, data mining and web intelligence, and software internationalisation</italic>
. Darlinghurst, Australia, Vol. 32, pp. 157–161.</mixed-citation>
</ref>
<ref id="CR17">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cho</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Garcia-Molina</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Page</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Efficient crawling through URL ordering</article-title>
<source>Computer Networks and ISDN Systems</source>
<year>1998</year>
<volume>30</volume>
<fpage>161</fpage>
<lpage>172</lpage>
<pub-id pub-id-type="doi">10.1016/S0169-7552(98)00108-1</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<mixed-citation publication-type="other">Daumé III, & H., Jagarlamudi, J. (2011). Domain adaptation for machine translation by mining unseen words. In
<italic>Proceedings of the 49th annual meeting of the association for computational linguistics and human language technologies, short papers</italic>
. Portland, Oregon, USA, pp. 407–412.</mixed-citation>
</ref>
<ref id="CR19">
<mixed-citation publication-type="other">Désilets, A., Farley, B., Stojanovic, M., & Patenaude, G. (2008). WeBiText: Building large heterogeneous translation memories from parallel web content. In
<italic>Proceedings of translating and the computer</italic>
. London, UK, Vol. 30, pp. 27–28.</mixed-citation>
</ref>
<ref id="CR20">
<mixed-citation publication-type="other">Dorado, I. G. (2008). Focused crawling: Algorithm survey and new approaches with a manual analysis. Master’s thesis, Department of Electro and Information Technology. Sweden: Lund University.</mixed-citation>
</ref>
<ref id="CR21">
<mixed-citation publication-type="other">Dziwiński, P., & Rutkowska, D. (2008). Ant focused crawling algorithm. In
<italic>Proceedings of the 9th international conference on artificial intelligence and soft computing</italic>
. Zakopane, Poland: Springer, pp. 1018–1028.</mixed-citation>
</ref>
<ref id="CR22">
<mixed-citation publication-type="other">Eck, M., Vogel, S., & Waibel, A. (2004). Language model adaptation for statistical machine translation based on information retrieval. In
<italic>Proceedings of the international conference on language resources and evaluation</italic>
. Lisbon, Portugal, pp. 327–330.</mixed-citation>
</ref>
<ref id="CR23">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Esplà-Gomis</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Forcada</surname>
<given-names>ML</given-names>
</name>
</person-group>
<article-title>Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with Bitextor</article-title>
<source>The Prague Bulletin of Mathemathical Lingustics</source>
<year>2010</year>
<volume>93</volume>
<fpage>77</fpage>
<lpage>86</lpage>
</element-citation>
</ref>
<ref id="CR24">
<mixed-citation publication-type="other">Finch, A., & Sumita, E. (2008). Dynamic model interpolation for statistical machine translation. In
<italic>Proceedings of the third workshop on statistical machine translation</italic>
. Columbus, Ohio, USA, pp. 208–215.</mixed-citation>
</ref>
<ref id="CR25">
<mixed-citation publication-type="other">Flournoy, R., & Duran, C. (2009). Machine translation and document localization at Adobe: From pilot to production. In
<italic>Proceedings of the twelfth machine translation summit</italic>
. Ottawa, Ontario, Canada, pp. 425–428.</mixed-citation>
</ref>
<ref id="CR27">
<mixed-citation publication-type="other">Foster, G., Goutte, C., & Kuhn, R. (2010). Discriminative instance weighting for domain adaptation in statistical machine translation. In
<italic>Proceedings of the 2010 conference on empirical methods in natural language processing</italic>
. Cambridge, Massachusetts, USA, pp. 451–459.</mixed-citation>
</ref>
<ref id="CR26">
<mixed-citation publication-type="other">Foster, G., & Kuhn, R. (2007). Mixture-model adaptation for SMT. In
<italic>Proceedings of the second workshop on statistical machine translation</italic>
. Prague, Czech Republic, pp. 128–135.</mixed-citation>
</ref>
<ref id="CR28">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gao</surname>
<given-names>Z</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Yi</surname>
<given-names>L</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Y</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>Q</given-names>
</name>
</person-group>
<article-title>Focused web crawling based on incremental learning</article-title>
<source>Journal of Computational Information Systems</source>
<year>2010</year>
<volume>6</volume>
<fpage>9</fpage>
<lpage>16</lpage>
</element-citation>
</ref>
<ref id="CR29">
<mixed-citation publication-type="other">Haddow, B. (2013). Applying pairwise ranked optimisation to improve the interpolation of translation models. In
<italic>Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies</italic>
. Atlanta, Georgia, pp. 342–347.</mixed-citation>
</ref>
<ref id="CR30">
<mixed-citation publication-type="other">He, Y., Ma, Y., Roturier, J., Way, A., & van Genabith, J. (2010). Improving the post-editing experience using translation recommendation: A user study. In
<italic>Proceedings of the ninth conference of the association for machine translation in the Americas</italic>
. Denver, Colorado, USA, pp. 247–256.</mixed-citation>
</ref>
<ref id="CR31">
<mixed-citation publication-type="other">Hildebrand, A.S., Eck, M., Vogel, S., & Waibel, A. (2005). Adaptation of the translation model for statistical machine translation based on information retrieval. In
<italic>Proceedings of the 10th annual conference of the European association for machine translation</italic>
. Budapest, Hungary, pp. 133–142.</mixed-citation>
</ref>
<ref id="CR32">
<mixed-citation publication-type="other">Johnson, H., Martin, J.D., Foster, G.F., & Kuhn, R. (2007). Improving translation quality by discarding most of the phrasetable. In
<italic>Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning</italic>
. Prague, Czech Republic, pp. 967–975.</mixed-citation>
</ref>
<ref id="CR33">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kilgarriff</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Grefenstette</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>Introduction to the special issue on the Web as corpus</article-title>
<source>Computational Linguistics</source>
<year>2003</year>
<volume>29</volume>
<issue>3</issue>
<fpage>333</fpage>
<lpage>348</lpage>
<pub-id pub-id-type="doi">10.1162/089120103322711569</pub-id>
</element-citation>
</ref>
<ref id="CR34">
<mixed-citation publication-type="other">Kneser, R., & Ney, H. (1995). Improved backing-off for N-gram language modeling. In
<italic>Proceedings of the international conference on acoustics. Speech and signal processing</italic>
. pp. 181–184.</mixed-citation>
</ref>
<ref id="CR35">
<mixed-citation publication-type="other">Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In
<italic>Proceedings of the 2004 conference on empirical methods in natural language processing</italic>
. Barcelona, Spain, pp. 388–395.</mixed-citation>
</ref>
<ref id="CR36">
<mixed-citation publication-type="other">Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In
<italic>Conference proceedings of the tenth machine translation summit</italic>
. Phuket, Thailand, pp. 79–86.</mixed-citation>
</ref>
<ref id="CR37">
<mixed-citation publication-type="other">Koehn, P., & Haddow, B. (2012). Interpolated backoff for factored translation models. In
<italic>Proceedings of the tenth biennial conference of the association for machine translation in the Americas</italic>
. San Diego, CA, USA.</mixed-citation>
</ref>
<ref id="CR38">
<mixed-citation publication-type="other">Koehn, P., & Schroeder, J. (2007). Experiments in domain adaptationfor statistical machine translation. In
<italic>Proceedings of the second workshop on statistical machine translation</italic>
. Prague, Czech Republic, pp. 224–227.</mixed-citation>
</ref>
<ref id="CR39">
<mixed-citation publication-type="other">Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical machine translation. In
<italic>Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions</italic>
(pp. 177–180). Prague: Czech Republic.</mixed-citation>
</ref>
<ref id="CR40">
<mixed-citation publication-type="other">Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. In
<italic>Proceedings of the 3rd ACM international conference on web search and data mining</italic>
. New York, New York, USA, pp. 441–450.</mixed-citation>
</ref>
<ref id="CR41">
<mixed-citation publication-type="other">Langlais, P. (2002). Improving a general-purpose statistical translation engine by terminological lexicons. In
<italic>COMPUTERM 2002: Second International Workshop on Computational Terminology</italic>
. Taipei, Taiwan, pp. 1–7.</mixed-citation>
</ref>
<ref id="CR42">
<mixed-citation publication-type="other">Mansour, S., Wuebker, J., & Ney, H. (2011). Combining translation and language model scoring for domain-specific data filtering. In
<italic>International workshop on spoken language translation</italic>
. San Francisco, California, USA, pp. 222–229.</mixed-citation>
</ref>
<ref id="CR43">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Menczer</surname>
<given-names>F</given-names>
</name>
</person-group>
<article-title>Mapping the semantics of Web text and links</article-title>
<source>IEEE Internet Computing</source>
<year>2005</year>
<volume>9</volume>
<fpage>27</fpage>
<lpage>36</lpage>
<pub-id pub-id-type="doi">10.1109/MIC.2005.59</pub-id>
</element-citation>
</ref>
<ref id="CR44">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Menczer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Belew</surname>
<given-names>RK</given-names>
</name>
</person-group>
<article-title>Adaptive retrieval agents: Internalizing local contextand scaling up to the web</article-title>
<source>Machine Learning</source>
<year>2000</year>
<volume>39</volume>
<fpage>203</fpage>
<lpage>242</lpage>
<pub-id pub-id-type="doi">10.1023/A:1007653114902</pub-id>
</element-citation>
</ref>
<ref id="CR45">
<mixed-citation publication-type="other">Moore, R.C., & Lewis, W. (2010). Intelligent selection of language model training data. In
<italic>Proceedings of the ACL 2010 conference short papers</italic>
. Uppsala, Sweden, pp. 220–224.</mixed-citation>
</ref>
<ref id="CR46">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Munteanu</surname>
<given-names>DS</given-names>
</name>
<name>
<surname>Marcu</surname>
<given-names>D</given-names>
</name>
</person-group>
<article-title>Improving machine translation performance by exploiting non-parallel corpora</article-title>
<source>Computational Linguistics</source>
<year>2005</year>
<volume>31</volume>
<fpage>477</fpage>
<lpage>504</lpage>
<pub-id pub-id-type="doi">10.1162/089120105775299168</pub-id>
</element-citation>
</ref>
<ref id="CR47">
<mixed-citation publication-type="other">Nakov, P. (2008). Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing. In
<italic>Proceedings of the third workshop on statistical machine translation</italic>
. Columbus, Ohio, USA, pp. 147–150.</mixed-citation>
</ref>
<ref id="CR48">
<mixed-citation publication-type="other">Nie, J.Y., Simard, M., Isabelle, P., & Durand, R. (1999). Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In
<italic>Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, ACM</italic>
. New York, New York, USA, pp. 74–81.</mixed-citation>
</ref>
<ref id="CR49">
<mixed-citation publication-type="other">Och, F.J. (2003). Minimum error rate training in statistical machine translation. In
<italic>Proceedings of the 41st annual meeting on association for computational linguisticsal international acm sigir conference on research and development in information retrieval, ACM</italic>
. Sapporo, Japan, pp. 160–167.</mixed-citation>
</ref>
<ref id="CR72">
<mixed-citation publication-type="other">Papavassiliou, V., Prokopidis, P., & Thurmair, G. (2013). A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In
<italic> Proceedings of the Sixth Workshop on Building and Using Comparable Corpora</italic>
(pp. 43–51). Sofia: Association for Computational Linguistics.</mixed-citation>
</ref>
<ref id="CR50">
<mixed-citation publication-type="other">Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). BLEU: A method for automatic evaluation of machine translation. In
<italic>Proceedings of the 40th annual meeting of the association for computational linguistics</italic>
. Philadelphia, Pennsylvania, USA, pp. 311–318.</mixed-citation>
</ref>
<ref id="CR51">
<mixed-citation publication-type="other">Pecina, P., Toral, A., Way, A., Papavassiliou, V., Prokopidis, P., & Giagkou, M. (2011). Towards using web-crawled data for domain adaptation in statistical machine translation. In
<italic>Proceedings of the 15th annual conference of the European associtation for machine translation</italic>
. Leuven, Belgium, pp. 297–304.</mixed-citation>
</ref>
<ref id="CR53">
<mixed-citation publication-type="other">Pecina, P., Toral, A., Papavassiliou, V., Prokopidis, P., & van Genabith, J. (2012a). Domain adaptation of statistical machine translation using Web-crawled resources: a case study. In M. Cettolo, M. Federico, L. Specia & A. Way (Eds.),
<italic>Proceedings of the 16th annual conference of the European association for machine translation</italic>
. Trento, Italy, pp. 145–152.</mixed-citation>
</ref>
<ref id="CR52">
<mixed-citation publication-type="other">Pecina, P., Toral, A., & van Genabith, J. (2012b). Simple and effective parameter tuning for domain adaptation of statistical machine translation. In
<italic>Proceedings of the 24th international conference on computational linguistics</italic>
. Mumbai, India, pp. 2209–2224.</mixed-citation>
</ref>
<ref id="CR54">
<mixed-citation publication-type="other">Penkale, S., Haque, R., Dandapat, S., Banerjee, P., Srivastava, A.K., Du, J., et al. (2010). MaTrEx: The DCU MT system for WMT 2010. In
<italic>Proceedings of the joint fifth workshop on statistical machine translation and MetricsMATR</italic>
. Uppsala, Sweden, pp. 143–148.</mixed-citation>
</ref>
<ref id="CR55">
<mixed-citation publication-type="other">Poch, M., Toral, A., Hamon, O., Quochi, V., & Bel, N. (2012). Towards a user-friendly platform for building language resources based on web services. In N. Calzolari, K. Choukri, T. Declerck, M.U. Dogan, B. Maegaard, J. Mariani, J. Odijk & S. Piperidis (Eds.),
<italic>LREC, European Language Resources Association (ELRA)</italic>
. pp. 1156–1163.</mixed-citation>
</ref>
<ref id="CR56">
<mixed-citation publication-type="other">Qi, X., & Davison, B. D. (2009). Web page classification: Features and algorithms.
<italic>ACM Computing Surveys 41</italic>
, 12:1–12:31.</mixed-citation>
</ref>
<ref id="CR57">
<mixed-citation publication-type="other">Qin, J., & Chen, H. (2005). Using genetic algorithm in building domain-specific collections: An experiment in the nanotechnology domain. In
<italic>Proceedings of the 38th annual Hawaii international conference on system sciences</italic>
(Vol. 4). Big Island, Hawaii, USA: IEEE Computer Society.</mixed-citation>
</ref>
<ref id="CR58">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Resnik</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>NA</given-names>
</name>
</person-group>
<article-title>The Web as a parallel corpus</article-title>
<source>Computational Linguistics, Special Issue on the Web as Corpus</source>
<year>2003</year>
<volume>29</volume>
<fpage>349</fpage>
<lpage>380</lpage>
</element-citation>
</ref>
<ref id="CR59">
<mixed-citation publication-type="other">Sanchis-Trilles, G., & Casacuberta, F. (2010). Log-linear weight optimisation via bayesian adaptation in statistical machine translation. In
<italic>The 23rd international conference on computational linguistics, posters volume</italic>
. Beijing, China, pp. 1077–1085.</mixed-citation>
</ref>
<ref id="CR60">
<mixed-citation publication-type="other">Sennrich, R. (2012). Perplexity minimization for translation model domain adaptation in statistical machine translation. In
<italic>Proceedings of the 13th conference of the European chapter of the association for computational linguistics</italic>
. Avignon, France, pp. 539–549.</mixed-citation>
</ref>
<ref id="CR61">
<mixed-citation publication-type="other">Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In
<italic>Proceedings of the 7th biennial conference of the association for machine translation in the Americas</italic>
. Cambridge, MA, USA, pp. 223–231.</mixed-citation>
</ref>
<ref id="CR62">
<mixed-citation publication-type="other">Spousta, M., Marek, M., & Pecina, P. (2008). Victor: The Web-page cleaning tool. In
<italic>Proceedings of the 4th web as corpus workshop: Can we beat Google?</italic>
. Marrakech, Morocco, pp. 12–17.</mixed-citation>
</ref>
<ref id="CR63">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Srinivasan</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Menczer</surname>
<given-names>F</given-names>
</name>
<name>
<surname>Pant</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>A general evaluation framework for topical crawlers</article-title>
<source>Information Retrieval</source>
<year>2005</year>
<volume>8</volume>
<fpage>417</fpage>
<lpage>447</lpage>
<pub-id pub-id-type="doi">10.1007/s10791-005-6993-5</pub-id>
</element-citation>
</ref>
<ref id="CR64">
<mixed-citation publication-type="other">Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In
<italic>Proceedings of international conference on spoken language processing</italic>
. Denver, Colorado, USA, pp. 257–286.</mixed-citation>
</ref>
<ref id="CR65">
<mixed-citation publication-type="other">Tillmann, C., Vogel, S., Ney, H., Zubiaga, A., & Sawaf, H. (1997). Accelerated dp based search for statistical translation. In
<italic>Proceedings of the fifth European conference on speech communication and technology</italic>
. Rhodes, Greece, pp. 2667–2670.</mixed-citation>
</ref>
<ref id="CR66">
<mixed-citation publication-type="other">Toral, A. (2013). Hybrid selection of language model training data using linguistic information and perplexity. In
<italic>Proceedings of the second workshop on hybrid approaches to translation</italic>
. Sofia, Bulgaria, pp. 8–12.</mixed-citation>
</ref>
<ref id="CR67">
<mixed-citation publication-type="other">Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V., & Nagy, V. (2005). Parallel corpora for medium density languages. In
<italic>Proceedings of the recent advances in natural language processing</italic>
. Borovets, Bulgaria, pp. 590–596.</mixed-citation>
</ref>
<ref id="CR68">
<mixed-citation publication-type="other">Wu, H., & Wang, H. (2004). Improving domain-specific word alignment with a general bilingual corpus. In
<italic>Proceedings of the 6th conference of the association for machine translation in the Americas</italic>
. Washington, DC, USA, pp. 262–271.</mixed-citation>
</ref>
<ref id="CR69">
<mixed-citation publication-type="other">Wu, H., Wang, H., & Zong, C. (2008). Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In
<italic>Proceedings of the 22nd international conference on computational linguistics</italic>
. Manchester, United Kingdom, Vol. 1, pp. 993–1000.</mixed-citation>
</ref>
<ref id="CR70">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>H</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>J</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>KCC</given-names>
</name>
</person-group>
<article-title>PEBL: Web page classification without negative examples</article-title>
<source>IEEE Transactions on Knowledge and Data Engineering</source>
<year>2004</year>
<volume>16</volume>
<issue>1</issue>
<fpage>70</fpage>
<lpage>81</lpage>
<pub-id pub-id-type="doi">10.1109/TKDE.2004.1264816</pub-id>
</element-citation>
</ref>
<ref id="CR71">
<mixed-citation publication-type="other">Zhang, Y., Wu, K., Gao, J., & Vines, P. (2006). Automatic acquisition of Chinese-English parallel corpus from the Web. In
<italic>Proceedings of the 28th European conference on information retrieval</italic>
. London, UK, pp. 420–431.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Sarre/explor/MusicSarreV3/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 0000460 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 0000460 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Sarre
   |area=    MusicSarreV3
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Sun Jul 15 18:16:09 2018. Site generation: Tue Mar 5 19:21:25 2024