Serveur d'exploration Tamazight

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.
***** Acces problem to record *****\

Identifieur interne : 000067 ( Pmc/Corpus ); précédent : 0000669; suivant : 0000680 ***** probable Xml problem with record *****

Links to Exploration step


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">A massively parallel corpus: the Bible in 100 languages</title>
<author>
<name sortKey="Christodouloupoulos, Christos" sort="Christodouloupoulos, Christos" uniqKey="Christodouloupoulos C" first="Christos" last="Christodouloupoulos">Christos Christodouloupoulos</name>
<affiliation>
<nlm:aff id="Aff1">Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana, IL 61801 USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Steedman, Mark" sort="Steedman, Mark" uniqKey="Steedman M" first="Mark" last="Steedman">Mark Steedman</name>
<affiliation>
<nlm:aff id="Aff2">School of Informatics, University of Edinburgh, Edinburgh, UK</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">26321896</idno>
<idno type="pmc">4551210</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4551210</idno>
<idno type="RBID">PMC:4551210</idno>
<idno type="doi">10.1007/s10579-014-9287-y</idno>
<date when="2014">2014</date>
<idno type="wicri:Area/Pmc/Corpus">000067</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000067</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">A massively parallel corpus: the Bible in 100 languages</title>
<author>
<name sortKey="Christodouloupoulos, Christos" sort="Christodouloupoulos, Christos" uniqKey="Christodouloupoulos C" first="Christos" last="Christodouloupoulos">Christos Christodouloupoulos</name>
<affiliation>
<nlm:aff id="Aff1">Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana, IL 61801 USA</nlm:aff>
</affiliation>
</author>
<author>
<name sortKey="Steedman, Mark" sort="Steedman, Mark" uniqKey="Steedman M" first="Mark" last="Steedman">Mark Steedman</name>
<affiliation>
<nlm:aff id="Aff2">School of Informatics, University of Edinburgh, Edinburgh, UK</nlm:aff>
</affiliation>
</author>
</analytic>
<series>
<title level="j">Language Resources and Evaluation</title>
<idno type="ISSN">1574-020X</idno>
<idno type="eISSN">1574-0218</idno>
<imprint>
<date when="2014">2014</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kanungo, T" uniqKey="Kanungo T">T Kanungo</name>
</author>
<author>
<name sortKey="Resnik, P" uniqKey="Resnik P">P Resnik</name>
</author>
<author>
<name sortKey="Mao, S" uniqKey="Mao S">S Mao</name>
</author>
<author>
<name sortKey="Kim, D" uniqKey="Kim D">D Kim</name>
</author>
<author>
<name sortKey="Zheng, Q" uniqKey="Zheng Q">Q Zheng</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Koehn, P" uniqKey="Koehn P">P Koehn</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Marcus, M" uniqKey="Marcus M">M Marcus</name>
</author>
<author>
<name sortKey="Santorini, B" uniqKey="Santorini B">B Santorini</name>
</author>
<author>
<name sortKey="Marcinkiewicz, Ma" uniqKey="Marcinkiewicz M">MA Marcinkiewicz</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Och, Fj" uniqKey="Och F">FJ Och</name>
</author>
<author>
<name sortKey="Ney, H" uniqKey="Ney H">H Ney</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Potthast, M" uniqKey="Potthast M">M Potthast</name>
</author>
<author>
<name sortKey="Barr N Cede O, A" uniqKey="Barr N Cede O A">A Barrón-Cedeño</name>
</author>
<author>
<name sortKey="Stein, B" uniqKey="Stein B">B Stein</name>
</author>
<author>
<name sortKey="Rosso, P" uniqKey="Rosso P">P Rosso</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Resnik, P" uniqKey="Resnik P">P Resnik</name>
</author>
<author>
<name sortKey="Olsen, M" uniqKey="Olsen M">M Olsen</name>
</author>
<author>
<name sortKey="Diab, M" uniqKey="Diab M">M Diab</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wei, Cp" uniqKey="Wei C">CP Wei</name>
</author>
<author>
<name sortKey="Yang, Cc" uniqKey="Yang C">CC Yang</name>
</author>
<author>
<name sortKey="Lin, Cm" uniqKey="Lin C">CM Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Lang Resour Eval</journal-id>
<journal-id journal-id-type="iso-abbrev">Lang Resour Eval</journal-id>
<journal-title-group>
<journal-title>Language Resources and Evaluation</journal-title>
</journal-title-group>
<issn pub-type="ppub">1574-020X</issn>
<issn pub-type="epub">1574-0218</issn>
<publisher>
<publisher-name>Springer Netherlands</publisher-name>
<publisher-loc>Dordrecht</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">26321896</article-id>
<article-id pub-id-type="pmc">4551210</article-id>
<article-id pub-id-type="publisher-id">9287</article-id>
<article-id pub-id-type="doi">10.1007/s10579-014-9287-y</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Original Paper</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A massively parallel corpus: the Bible in 100 languages</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Christodouloupoulos</surname>
<given-names>Christos</given-names>
</name>
<address>
<email>christod@illinois.edu</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Steedman</surname>
<given-names>Mark</given-names>
</name>
<address>
<email>steedman@inf.ed.ac.uk</email>
</address>
<xref ref-type="aff" rid="Aff2"></xref>
</contrib>
<aff id="Aff1">
<label></label>
Department of Computer Science, UIUC, 201 N. Goodwin Ave, Urbana, IL 61801 USA</aff>
<aff id="Aff2">
<label></label>
School of Informatics, University of Edinburgh, Edinburgh, UK</aff>
</contrib-group>
<pub-date pub-type="epub">
<day>19</day>
<month>11</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="pmc-release">
<day>19</day>
<month>11</month>
<year>2014</year>
</pub-date>
<pub-date pub-type="ppub">
<year>2015</year>
</pub-date>
<volume>49</volume>
<issue>2</issue>
<fpage>375</fpage>
<lpage>395</lpage>
<permissions>
<copyright-statement>© The Author(s) 2014</copyright-statement>
<license license-type="OpenAccess">
<license-p>
<bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.</license-p>
</license>
</permissions>
<abstract id="Abs1">
<p>We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora.</p>
</abstract>
<kwd-group xml:lang="en">
<title>Keywords</title>
<kwd>Parallel corpus</kwd>
<kwd>Multilingual corpus</kwd>
<kwd>Comparative corpus linguistics</kwd>
</kwd-group>
<custom-meta-group>
<custom-meta>
<meta-name>issue-copyright-statement</meta-name>
<meta-value>© Springer Science+Business Media Dordrecht 2015</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body>
<sec id="Sec1" sec-type="intro">
<title>Introduction</title>
<p>Parallel corpora are a valuable resource for linguistic research and natural language processing (NLP) applications. One of the main uses of the latter kind is as training material for statistical machine translation (SMT), where large amounts of aligned data are standardly used to learn word alignment models between the lexica of two languages (for example, in the Giza++ system of Och and Ney
<xref ref-type="bibr" rid="CR26">2003</xref>
). Another interesting use of parallel corpora in NLP is
<italic>projected learning</italic>
of linguistic structure. In this approach, supervised data from a resource-rich language is used to guide the unsupervised learning algorithm in a target language. Although there are some techniques that do not require parallel texts (e.g. Cohen et al.
<xref ref-type="bibr" rid="CR8">2011</xref>
), the most successful models use sentence-aligned corpora (Yarowsky and Ngai
<xref ref-type="bibr" rid="CR40">2001</xref>
; Das and Petrov
<xref ref-type="bibr" rid="CR9">2011</xref>
).</p>
<p>Most parallel corpora exist in a small number of languages or in common languages pairs (e.g. the English-French
<italic>Hansards</italic>
corpus by Germann
<xref ref-type="bibr" rid="CR15">2001</xref>
). There are however, a few corpora that contain multiple languages: The Europarl corpus (Koehn
<xref ref-type="bibr" rid="CR18">2005</xref>
) contains parallel translations of European Parliament proceedings in 21 languages; the Joint Research Centre of the European Commission has released multiple corpora in more than 20 languages, including the sentence-aligned JRC-Acquis (22 languages, Steinberger et al.
<xref ref-type="bibr" rid="CR31">2006</xref>
) and the paragraph-aligned DGT-Acquis (23 languages); the InterCorp corpus (Čermák and Rosen
<xref ref-type="bibr" rid="CR6">2012</xref>
), a collection of texts in Czech and 27 other European languages. To our knowledge, the most multilingual corpus currently available is the OPUS collection Tiedemann
<xref ref-type="bibr" rid="CR33">2012</xref>
which contains 90 languages in various parallel corpora. However, comparatively few of the possible language pairs are available with parallel text.
<xref ref-type="fn" rid="Fn1">1</xref>
</p>
<p>In an attempt to access parallel material from as many and as diverse languages as possible, a very widely translated text is needed. In this work we will be following Resnik et al. (
<xref ref-type="bibr" rid="CR29">1999</xref>
) in creating a massively parallel corpus based on Bible translations (cf. Abney and Bird
<xref ref-type="bibr" rid="CR1">2010</xref>
,
<xref ref-type="bibr" rid="CR2">2011</xref>
). According to United Bible Societies (
<xref ref-type="bibr" rid="CR35">2013</xref>
) there are at least 2,527 translations of parts of the Bible and 475 full translations. These numbers exceed by far the translations of any other work of literature—according to Wikipedia (
<xref ref-type="bibr" rid="CR39">2013</xref>
) the next most translated work of literature is ‘Pinocchio’ with 260 languages.</p>
<p>Resnik et al. (
<xref ref-type="bibr" rid="CR29">1999</xref>
) used 13 different translations of the Bible; we will increase the number of languages to 100. By having 100 different languages on the same corpus we can get 4,950 unique language pairs
<xref ref-type="fn" rid="Fn2">2</xref>
—although not all translations contain the entire Bible as we shall see later—making this by far the largest number of bitexts available: in comparison, DGT-Acquis contains 253 pairs; InterCorp, 351; and the OPUS collection contains 3,800 pairs (Tiedemann
<xref ref-type="bibr" rid="CR33">2012</xref>
), but not all pairs contain the same amount of text.</p>
</sec>
<sec id="Sec2">
<title>The Bible as a corpus</title>
<sec id="Sec3">
<title>Current and potential uses of the corpus</title>
<p>As we mentioned in the previous section, most parallel corpora are created for SMT training purposes. While the relatively small size of the present corpus makes it rather unsuitable for the creation of full-scale SMT systems across the 4,950 language pairs, we believe that it can be used to tune the probability distributions of an existing SMT system for a phylogenetically similar language. Alternatively it can be used as a source of bi/multi-lingual dictionaries in emergency situations where human translators or other linguistic resources are not available [e.g. the earthquakes in Haiti (Lewis
<xref ref-type="bibr" rid="CR21">2010</xref>
) or Japan (Neubig et al.
<xref ref-type="bibr" rid="CR24">2011</xref>
)].</p>
<p>Steinberger et al. (
<xref ref-type="bibr" rid="CR32">2013</xref>
) list a number of potential uses of parallel corpora in NLP. These include: annotation projection for co-reference resolution, discourse analysis; checking translation consistency automatically; testing and benchmarking alignment software (for sentences, words, etc.); producing multilingual lexical and semantic resources such as dictionaries and ontologies; annotation projection across languages for Named Entity Recognition (NER, Ehrmann et al.
<xref ref-type="bibr" rid="CR12">2011</xref>
), sentiment analysis (Steinberger et al.
<xref ref-type="bibr" rid="CR30">2011</xref>
), multi-document summarization (Turchi et al.
<xref ref-type="bibr" rid="CR34">2010</xref>
); cross-lingual plagiarism detection (Potthast et al.
<xref ref-type="bibr" rid="CR27">2011</xref>
); multilingual and cross-lingual document classification (Wei et al.
<xref ref-type="bibr" rid="CR37">2008</xref>
); creation of multilingual semantic space in Lexical Semantic Analysis (LSA, Landauer and Littman
<xref ref-type="bibr" rid="CR19">1991</xref>
) and Kernel Canonical Correlation Analysis (KCCA, Vinokourov et al.
<xref ref-type="bibr" rid="CR36">2002</xref>
). We believe that, despite some disadvantages (e.g. the lack of modern named entities and other issues discussed in Sect.
<xref rid="Sec5" ref-type="sec">2.3</xref>
), the Bible is an excellent resource for NLP, especially for low-resource languages.</p>
<p>Multilingual corpora are also ideal for typological or comparative language analysis, especially when a large number of languages can be collected. Indeed the present corpus has already been used for cross-linguistic induction and comparison of syntactic categories (Christodoulopoulos
<xref ref-type="bibr" rid="CR7">2013</xref>
, pp. 143–159). Similarly, we believe that parallel corpora can be invaluable to the whole area of Digital Humanities (e.g. Dipper and Schultz-Balluff
<xref ref-type="bibr" rid="CR10">2013</xref>
).</p>
</sec>
<sec id="Sec4">
<title>Advantages</title>
<p>There are a number of advantages to using the Bible as a corpus. Not only has it been translated into numerous languages; it has also been translated into a much more diverse set of languages than any other book. This is mostly due to the efforts of missionary linguists such as the Summer Institute of Linguistics (SIL, Brend and Pike
<xref ref-type="bibr" rid="CR4">1977</xref>
) that combine anthropological and linguistic research with missionary expeditions in remote locations and, as a result, produce Bible translations.</p>
<p>Another advantage of the Bible is the size of the text. The complete canonical 66 books contain around 800k words in English. This might seem small compared to modern (parallel) corpora—like, for instance, the Canadian Hansards corpus (Germann
<xref ref-type="bibr" rid="CR15">2001</xref>
) with ~19 M words, and the Europarl (~60 M words on average per language); however it is much bigger than any single work of literature: for instance, the size of the average fiction novel is about 100k words, while ‘Pinocchio’ is ~45 k.</p>
<p>The Bible also is unique as a text since every verse is uniquely identified by a book, chapter and verse number. This allows for an automatic, unambiguous alignment at the verse level across every language (with minor exceptions that will be discussed in Sect.
<xref rid="Sec8" ref-type="sec">3</xref>
).</p>
<p>A final advantage is that the Bible translations collected here are either public domain, or—as in the case of the King James Version—free to use for research purposes.
<xref ref-type="fn" rid="Fn3">3</xref>
</p>
</sec>
<sec id="Sec5">
<title>Potential issues</title>
<sec id="Sec6">
<title>Translation methods</title>
<p>As with every translation work, one important question concerns the style and fidelity of translation. There are two competing translation methods:
<italic>word-for-word</italic>
(or formal equivalence), in which the literal meaning of each words as well as the syntactic structure is preserved where possible; and
<italic>sense-for-sense</italic>
translation (or dynamic equivalence), in which the ‘spirit’ or emotional effect of the text is kept. The former method is more appropriate for the type of analysis required here and has been put forward as the preferred method by the Catholic Church (
<xref ref-type="bibr" rid="CR5">2001</xref>
), among others. However, some of the translation guides used by the missionary linguists follow the latter method. For instance Nida and Taber (
<xref ref-type="bibr" rid="CR25">1969</xref>
) provide a theoretical framework as well as a set of principles for Bible translations, in which they advise:
<list list-type="bullet">
<list-item>
<p>Content is to have priority over style.</p>
</list-item>
<list-item>
<p>Contextual consistency is to have priority over verbal consistency.</p>
</list-item>
<list-item>
<p>Long, involved sentences are to be broken up on the basis of receptor-language usage.</p>
</list-item>
<list-item>
<p>Nouns expressing events should be changed to verbs whenever the results would be more in keeping with receptor-language usage. (Nida and Taber
<xref ref-type="bibr" rid="CR25">1969</xref>
, p. 182)</p>
</list-item>
</list>
This does not imply that every Bible translator has followed these principles, but given that the goal of the missionary linguists was to convey the message of the Bible, it makes sense that they would choose a more content-sensitive approach to their translations.</p>
<p>Finally, we should keep in mind that it is not always desirable to have a formally equivalent translation
<xref ref-type="fn" rid="Fn4">4</xref>
: for instance in MT, when translating the title of Stig Larsson’s third book, a translation system should return “The Girl Who Kicked the Hornet’s Nest” instead of the literal translation of the Swedish “Luftslottet som sprängdes” which would be “The air castle that was exploded”. However, from a computational linguistics perspective, it is usually more helpful to have formally equivalent translations.</p>
</sec>
<sec id="Sec7">
<title>Other issues</title>
<p>A major issue that is relevant to the use of the Bible as a parallel corpus is the writing style; in particular, the use of antiquated language. This is especially problematic in languages (mostly Western European) where Bible translations were created hundreds of years in the past. Even if modern translations exist, often the editors would choose a more archaic style of writing to match the earlier text and to give the appropriate gravity to the material. Some exceptions exist, at least in English. As Resnik et al. (
<xref ref-type="bibr" rid="CR29">1999</xref>
) showed, the New International Version (NIV) covers a significant variety of present-day terms as found in Longman Dictionary of Contemporary English (LDOCE, Proctor
<xref ref-type="bibr" rid="CR28">1978</xref>
) and in the Brown Corpus (Francis
<xref ref-type="bibr" rid="CR14">1964</xref>
).</p>
<p>For many translations, it is an open question whether the writing style of the Bible is representative of present-day language, but given the limited availability of written sources in some languages, and the breadth of available translations, the Bible corpus represents the best resource for cross-linguistic analysis. Indeed there have been a number of projects that used Bible translations as either a primary or secondary source of material (Resnik et al.
<xref ref-type="bibr" rid="CR29">1999</xref>
; Yarowsky and Ngai
<xref ref-type="bibr" rid="CR40">2001</xref>
; Wierzbicka
<xref ref-type="bibr" rid="CR38">2001</xref>
; Kanungo et al.
<xref ref-type="bibr" rid="CR17">2005</xref>
).</p>
<p>A final limiting factor is the fact that the alignment information is limited to verses (rather than sentences as is the case in the JRC-Acquis corpus for instance). While it is often the case that a verse corresponds to a whole sentence, there are verses that span more than two sentences, or are limited to sub-sentence phrases. The exact number varies depending on what is considered to be sentence-final punctuation. When counting only ‘.’ and ‘?’, out of the ~30,000 verses, only 4,000 contain multiple sentences. However, this number increases to 10,000 if we include ‘;’ and more than half the verses if we add ‘:’ as a sentence-final marker. To make things worse, as we can see in the following example, different translations use different punctuation schemas which means that they contain significantly different numbers of sentences.
<list list-type="order">
<list-item>
<p>      a.  [A ka ki te Atua], [Kia marama]: [na ka marama]                   (Maori)</p>
<p>      b. [Gu
<inline-formula id="IEq7">
<alternatives>
<tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eth $$\end{document}</tex-math>
<mml:math id="M2">
<mml:mi>ð</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq7.gif"></inline-graphic>
</alternatives>
</inline-formula>
sag
<inline-formula id="IEq8">
<alternatives>
<tex-math id="M3">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eth$$\end{document}</tex-math>
<mml:math id="M4">
<mml:mi>ð</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq8.gif"></inline-graphic>
</alternatives>
</inline-formula>
i]:  [“Ver
<inline-formula id="IEq9">
<alternatives>
<tex-math id="M5">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eth $$\end{document}</tex-math>
<mml:math id="M6">
<mml:mi>ð</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq9.gif"></inline-graphic>
</alternatives>
</inline-formula>
i ljós!”]  [Og þa
<inline-formula id="IEq10">
<alternatives>
<tex-math id="M7">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eth $$\end{document}</tex-math>
<mml:math id="M8">
<mml:mi>ð</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq10.gif"></inline-graphic>
</alternatives>
</inline-formula>
var
<inline-formula id="IEq11">
<alternatives>
<tex-math id="M9">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\eth $$\end{document}</tex-math>
<mml:math id="M10">
<mml:mi>ð</mml:mi>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq11.gif"></inline-graphic>
</alternatives>
</inline-formula>
ljós].                   (Icelandic)</p>
<p>      c. [Dio disse]: [≪Sia la luce!≫]. [E la luce fu]                          (Italian)</p>
<p>      d. [dixitque Deus] [fiat lux] [et facta est lux]                                (Latin)</p>
<p>      e.  [And God said], [Let there be light]: [and there was light].      (English)</p>
</list-item>
</list>
</p>
</sec>
</sec>
</sec>
<sec id="Sec8">
<title>Acquiring and converting source material</title>
<sec id="Sec9">
<title>Corpus collection</title>
<p>Despite the great number of translations, many Bible translations exist only in audio form. This is reasonable, since some of the translated languages exist only in verbal form, and even if an alphabet exists, most speakers of that language may be illiterate. Furthermore, even when textual resources have been available for years, electronic copies are hard to obtain. This means that there is a limited availability of machine-readable bibles online. In English, for instance, one of the most widespread Bibles, the King James Version, is not made available in electronic form by the official licensing body in Scotland (the Scottish Bible Board) even though the text is free to use for research purposes. Instead, we have had to rely on third-party sources, like the ones mentioned in the next paragraph. When multiple versions of the Bible were available—since the aim of this project was breadth instead of depth—we opted for a single translation, usually the oldest available one (e.g. the King James Version for English). We believe that this will lead to a more coherent corpus, as older translations tend to be more literal, but we acknowledge that this brings up the issue of diachronic language change. As discussed in Sect.
<xref rid="Sec1" ref-type="sec">1</xref>
, this problem is not as severe as initially perceived; however we are also open to the idea of adding multiple versions of the same language in the future.
<fig id="Fig1">
<label>Fig. 1</label>
<caption>
<p>Different Bible online versions of Gen:1–2 in Afrikaans (last box in the figure is in Dutch)</p>
</caption>
<graphic xlink:href="10579_2014_9287_Fig1_HTML" id="MO1"></graphic>
</fig>
<fig id="Fig2">
<label>Fig. 2</label>
<caption>
<p>Level 1 CES annotation</p>
</caption>
<graphic xlink:href="10579_2014_9287_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
<p>There are a few websites that offer access to public domain, machine-readable versions of the Bible in multiple languages. The four main sources used here were the
<italic>Bible Database</italic>
, the
<italic>Unbound Bible</italic>
,
<italic>GospelGo</italic>
and the
<italic>Bible Gateway</italic>
websites. Each one offered the Bible in different formats, some containing HTML and others plain text. Figure 
<xref rid="Fig1" ref-type="fig">1</xref>
presents a comparison of the different versions.</p>
<p>In order to unify all the different styles of annotation under a well-defined universal format, we followed Resnik et al. (
<xref ref-type="bibr" rid="CR29">1999</xref>
) in using the Corpus Encoding Standard (CES, Ide
<xref ref-type="bibr" rid="CR16">1998</xref>
), conforming to the level 1 annotation guidelines. Practically, this means that each Bible was formatted as an XML file, containing nested
elements corresponding to books and chapters, and elements that corresponded to verses. Each of the verses was marked with a serial ID. Figure 
<xref rid="Fig2" ref-type="fig">2</xref>
shows the same two verses of Fig. 
<xref rid="Fig1" ref-type="fig">1</xref>
as formatted by custom scripts.</p>
</sec>
<sec id="Sec10">
<title>Conversion problems</title>
<p>The most common issue we encountered when converting and formatting our corpus was the inconsistency in the formatting of the online sources. Some of the more common ones included incorrect HTML: unclosed or

tags, inconsistent use of capitalisation (e.g. Verse Text ); errors in verse numbering (e.g. “missing” verses were actually included in previous or subsequent verses marked by text instead of HTML tags); character rendering errors (e.g. ž in Croatian rendered as ?); missing characters (e.g. final character in each verse of the Thai and Latin translations). In most cases the errors were systematic and could be corrected semi-automatically; in other cases (like the missing characters) we had to find multiple sources of the same translation. If neither option was available, the errors were left in the final version of the corpus. Overall, the whole process took about two-to-three person/months.</p>

<p>Finally, when dealing with machine-readable multilingual texts, character encoding can cause difficulties. This is especially true for languages that do not have a strong international presence and the need to adopt an encoding standard is low. However, we did not encounter such problems during the creation of this corpus; all languages included in the corpus have been encoded using the Universal Character Set (UCS, Allen et al.
<xref ref-type="bibr" rid="CR3">2012</xref>
), specifically the UCS Transformation Format-8-bit (UTF-8).</p>
</sec>
</sec>
<sec id="Sec11">
<title>Parallel corpus information</title>
<p>The full corpus contains 100 languages from across the world (see Table 
<xref rid="Tab1" ref-type="table">1</xref>
for the names of the languages). As Table 
<xref rid="Tab2" ref-type="table">2</xref>
shows, the majority are non-Indo-European languages and 39 of the languages are spoken by fewer than 1 million speakers.</p>
<p>Figure 
<xref rid="Fig3" ref-type="fig">3</xref>
presents a geographical distribution of the languages (data from Dryer and Haspelmath
<xref ref-type="bibr" rid="CR11">2013</xref>
) that cover almost all the continents, and
<xref rid="Sec14" ref-type="sec">Appendix A</xref>
contains detailed linguistic information about every language as well as the approximate date of translation (data from Lewis et al.
<xref ref-type="bibr" rid="CR20">2014</xref>
).
<table-wrap id="Tab1">
<label>Table 1</label>
<caption>
<p>Languages in the Bible Corpus</p>
</caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left">Achuar-Shiwiar</td>
<td align="left">Gaelic (Scottish)
<sup></sup>
</td>
<td align="left">
<bold>Polish</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>Afrikaans</bold>
</td>
<td align="left">Galela</td>
<td align="left">
<bold>Portuguese</bold>
</td>
</tr>
<tr>
<td align="left">Aguaruna</td>
<td align="left">
<bold>German</bold>
</td>
<td align="left">Potawatomi
<sup></sup>
</td>
</tr>
<tr>
<td align="left">Akawaio</td>
<td align="left">
<bold>Greek</bold>
</td>
<td align="left">
<bold>Q’eqchi’</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>Albanian</bold>
</td>
<td align="left">Gujarati</td>
<td align="left">Quichua</td>
</tr>
<tr>
<td align="left">Amharic</td>
<td align="left">
<bold>Haitian Creole</bold>
</td>
<td align="left">Romani</td>
</tr>
<tr>
<td align="left">Amuzgo</td>
<td align="left">
<bold>Hebrew</bold>
</td>
<td align="left">
<bold>Romanian</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>Arabic</bold>
</td>
<td align="left">
<bold>Hindi</bold>
</td>
<td align="left">
<bold>Russian</bold>
</td>
</tr>
<tr>
<td align="left">Armenian
<sup></sup>
</td>
<td align="left">
<bold>Hungarian</bold>
</td>
<td align="left">
<bold>Serbian</bold>
</td>
</tr>
<tr>
<td align="left">Aukan</td>
<td align="left">
<bold>Icelandic</bold>
</td>
<td align="left">Shuar (Jivaro)</td>
</tr>
<tr>
<td align="left">Barasana-Eduria</td>
<td align="left">
<bold>Indonesian</bold>
</td>
<td align="left">
<bold>Slovak</bold>
</td>
</tr>
<tr>
<td align="left">Basque</td>
<td align="left">
<bold>Italian</bold>
</td>
<td align="left">
<bold>Slovene</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>Bulgarian</bold>
</td>
<td align="left">Jakalteko</td>
<td align="left">
<bold>Somali</bold>
</td>
</tr>
<tr>
<td align="left">Cabécar</td>
<td align="left">
<bold>Japanese</bold>
</td>
<td align="left">
<bold>Spanish</bold>
</td>
</tr>
<tr>
<td align="left">Cakchiquel</td>
<td align="left">K’iche’</td>
<td align="left">Swahili</td>
</tr>
<tr>
<td align="left">Campa (Asháninka)</td>
<td align="left">Kabyle</td>
<td align="left">
<bold>Swedish</bold>
</td>
</tr>
<tr>
<td align="left">Camsá</td>
<td align="left">
<bold>Kannada</bold>
</td>
<td align="left">Syriac</td>
</tr>
<tr>
<td align="left">
<bold>Cebuano</bold>
</td>
<td align="left">
<bold>Korean</bold>
</td>
<td align="left">Tachelhit</td>
</tr>
<tr>
<td align="left">Chamorro
<sup></sup>
</td>
<td align="left">
<bold>Latin</bold>
</td>
<td align="left">
<bold>Tagalog</bold>
</td>
</tr>
<tr>
<td align="left">Cherokee</td>
<td align="left">Latvian</td>
<td align="left">Tamajaq (Tuareg)
<sup></sup>
</td>
</tr>
<tr>
<td align="left">Chinantec (Quiotepec)</td>
<td align="left">
<bold>Lithuanian</bold>
</td>
<td align="left">Telugu</td>
</tr>
<tr>
<td align="left">
<bold>Chinese</bold>
</td>
<td align="left">Lukpa</td>
<td align="left">
<bold>Thai</bold>
</td>
</tr>
<tr>
<td align="left">Coptic</td>
<td align="left">
<bold>Malagasy</bold>
</td>
<td align="left">
<bold>Turkish</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>Croatian</bold>
</td>
<td align="left">
<bold>Malayalam</bold>
</td>
<td align="left">Ukranian</td>
</tr>
<tr>
<td align="left">
<bold>Czech</bold>
</td>
<td align="left">Mam</td>
<td align="left">Uma</td>
</tr>
<tr>
<td align="left">
<bold>Danish</bold>
</td>
<td align="left">Manx
<sup></sup>
</td>
<td align="left">Uspanteco</td>
</tr>
<tr>
<td align="left">Dinka</td>
<td align="left">
<bold>Maori</bold>
</td>
<td align="left">
<bold>Vietnamese</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>English</bold>
</td>
<td align="left">
<bold>Marathi</bold>
</td>
<td align="left">Wolaytta</td>
</tr>
<tr>
<td align="left">
<bold>Esperanto</bold>
</td>
<td align="left">
<bold>Myanmar (Burmese)</bold>
</td>
<td align="left">Wolof</td>
</tr>
<tr>
<td align="left">
<bold>Estonian</bold>
</td>
<td align="left">Nahuatl (Tetelcingo)</td>
<td align="left">
<bold>Xhosa</bold>
</td>
</tr>
<tr>
<td align="left">Ewe</td>
<td align="left">
<bold>Nepali</bold>
</td>
<td align="left">
<bold>Zarma</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>Farsi</bold>
(Persian)</td>
<td align="left">
<bold>Norwegian</bold>
</td>
<td align="left">Zulu</td>
</tr>
<tr>
<td align="left">
<bold>Finnish</bold>
</td>
<td align="left">Ojibwa</td>
<td align="left"></td>
</tr>
<tr>
<td align="left">
<bold>French</bold>
</td>
<td align="left">Paite (Chin)</td>
<td align="left"></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The languages containing the full Bible text are in bold. Most of the remaining languages contain the New Testament part of the Bible only (languages marked with
<sup></sup>
 contain smaller parts)</p>
</table-wrap-foot>
</table-wrap>
<table-wrap id="Tab2">
<label>Table 2</label>
<caption>
<p>Bible Corpus language information</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left"></th>
<th align="left"># Languages</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Non-Latin script</td>
<td align="left">28</td>
</tr>
<tr>
<td align="left"><1M speakers</td>
<td align="left">39</td>
</tr>
<tr>
<td align="left">Non-Indo-European</td>
<td align="left">66</td>
</tr>
<tr>
<td align="left">Partial Texts</td>
<td align="left">45</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="Fig3">
<label>Fig. 3</label>
<caption>
<p>A map of the distribution of languages in the Bible Corpus. Each pin represents the country or territory where the language originates or is primarily used (e.g. there is only one pin for English in Britain). Location coordinates were acquired from Dryer and Haspelmath (
<xref ref-type="bibr" rid="CR11">2013</xref>
)</p>
</caption>
<graphic xlink:href="10579_2014_9287_Fig3_HTML" id="MO3"></graphic>
</fig>
<fig id="Fig4">
<label>Fig. 4</label>
<caption>
<p>Standardized type-token ratio and average verse length information for each of the languages in the corpus</p>
</caption>
<graphic xlink:href="10579_2014_9287_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
<p>Table
<xref rid="Tab3" ref-type="table">3</xref>
contains statistics about the average size and variability of the lexicon of the whole corpus: we include total number of tokens,
<xref ref-type="fn" rid="Fn5">5</xref>
standardised type-token ratio (STTR), average verse length and the percentage of the corpus covered by the 1,000 most frequent words. We also present STTR and average verse length information for each individual language in Fig.
<xref rid="Fig4" ref-type="fig">4</xref>
.
<table-wrap id="Tab3">
<label>Table 3</label>
<caption>
<p>Bible Corpus statistics</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">Corpus</th>
<th align="left"># Tokens</th>
<th align="left">STTR (%)</th>
<th align="left">Length</th>
<th align="left">SD</th>
<th align="left">Top-1,000 cover (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Bible-avg</td>
<td char="." align="char">432,691</td>
<td char="." align="char">48.59</td>
<td char="." align="char">23.82</td>
<td char="." align="char">7.46</td>
<td char="." align="char">73.80</td>
</tr>
<tr>
<td align="left">Bible-eng</td>
<td char="." align="char">789,635</td>
<td char="." align="char">34.42</td>
<td char="." align="char">28.35</td>
<td char="." align="char">12.58</td>
<td char="." align="char">88.69</td>
</tr>
<tr>
<td align="left">WSJ</td>
<td char="." align="char">1,173,760</td>
<td char="." align="char">48.89</td>
<td char="." align="char">24.92</td>
<td char="." align="char">12.57</td>
<td char="." align="char">74.11</td>
</tr>
<tr>
<td align="left">1984-novel</td>
<td char="." align="char">122,644</td>
<td char="." align="char">47.56</td>
<td char="." align="char">19.99</td>
<td char="." align="char">15.20</td>
<td char="." align="char">81.89</td>
</tr>
<tr>
<td align="left">CHILDES</td>
<td char="." align="char">366,509</td>
<td char="." align="char">32.17</td>
<td char="." align="char">4.45</td>
<td char="." align="char">3.04</td>
<td char="." align="char">93.60</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>STTR is standardised type-token ratio; length refers to the average/standard deviation number of tokens in each verse (or sentence for the other corpora). Bible-avg is the (macro) average over all the languages in the corpus; WSJ is the Wall Street Journal portion of the Penn Treebank (Marcus et al.
<xref ref-type="bibr" rid="CR23">1993</xref>
); George Orwell’s 1984 novel is part of the MULTEXT-East corpus (Erjavec
<xref ref-type="bibr" rid="CR13">2004</xref>
); CHILDES (MacWhinney
<xref ref-type="bibr" rid="CR22">2000</xref>
) is a corpus of child-directed speech utterances</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>In order to normalise over the overall size of each corpus, we computed STTR by calculating a macro-average over successive measurements of the token-type ratio (# unique word types/# all tokens) of a fixed amount of tokens. This fixed amount corresponded to the smallest number of tokens (678 tokens in Gaelic). We also include the specific numbers for the English translation as well as number from other corpora for comparison: the Wall Street Journal (WSJ) portion of the Penn Treebank (Marcus et al.
<xref ref-type="bibr" rid="CR23">1993</xref>
), George Orwell’s 1984 novel and a corpus of child-directed speech (CHILDES; MacWhinney
<xref ref-type="bibr" rid="CR22">2000</xref>
).</p>
<p>We can see that although the average type-token ratio of the corpus is close to that of both WSJ and 1984, the English translation has far fewer unique word types. Following Resnik et al. (
<xref ref-type="bibr" rid="CR29">1999</xref>
), we also compared the vocabulary of the English translation with that of the other English corpora we had available. As Table
<xref rid="Tab4" ref-type="table">4</xref>
shows, even though the language King James’ Version of the Bible is more archaic than the New International Version (used in Resnik et al’s comparisons), it still covers a significant portion of the most frequent words in all three corpora.
<table-wrap id="Tab4">
<label>Table 4</label>
<caption>
<p>English Bible coverage of the most frequent words in three corpora</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">corpus
<inline-formula id="IEq28">
<alternatives>
<tex-math id="M11">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{topN}$$\end{document}</tex-math>
<mml:math id="M12">
<mml:msub>
<mml:mrow></mml:mrow>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>p</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq28.gif"></inline-graphic>
</alternatives>
</inline-formula>
</th>
<th align="left">Coverage</th>
<th align="left">Percentage</th>
<th align="left">Missing words cover (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">WSJ
<inline-formula id="IEq29">
<alternatives>
<tex-math id="M13">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{500}$$\end{document}</tex-math>
<mml:math id="M14">
<mml:msub>
<mml:mrow></mml:mrow>
<mml:mn>500</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq29.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td char="." align="char">347</td>
<td char="." align="char">69.40</td>
<td char="." align="char">8.03</td>
</tr>
<tr>
<td align="left">1984
<inline-formula id="IEq30">
<alternatives>
<tex-math id="M15">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{500}$$\end{document}</tex-math>
<mml:math id="M16">
<mml:msub>
<mml:mrow></mml:mrow>
<mml:mn>500</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq30.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td char="." align="char">423</td>
<td char="." align="char">86.40</td>
<td char="." align="char">3.42</td>
</tr>
<tr>
<td align="left">CHILDES
<inline-formula id="IEq31">
<alternatives>
<tex-math id="M17">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{500}$$\end{document}</tex-math>
<mml:math id="M18">
<mml:msub>
<mml:mrow></mml:mrow>
<mml:mn>500</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq31.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td char="." align="char">401</td>
<td char="." align="char">80.20</td>
<td char="." align="char">6.75</td>
</tr>
<tr>
<td align="left">WSJ
<inline-formula id="IEq32">
<alternatives>
<tex-math id="M19">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{1000}$$\end{document}</tex-math>
<mml:math id="M20">
<mml:msub>
<mml:mrow></mml:mrow>
<mml:mn>1000</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq32.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td char="." align="char">403</td>
<td char="." align="char">59.70</td>
<td char="." align="char">12.18</td>
</tr>
<tr>
<td align="left">1984
<inline-formula id="IEq33">
<alternatives>
<tex-math id="M21">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{1000}$$\end{document}</tex-math>
<mml:math id="M22">
<mml:msub>
<mml:mrow></mml:mrow>
<mml:mn>1000</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq33.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td char="." align="char">762</td>
<td char="." align="char">76.20</td>
<td char="." align="char">5.95</td>
</tr>
<tr>
<td align="left">CHILDES
<inline-formula id="IEq34">
<alternatives>
<tex-math id="M23">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{1000}$$\end{document}</tex-math>
<mml:math id="M24">
<mml:msub>
<mml:mrow></mml:mrow>
<mml:mn>1000</mml:mn>
</mml:msub>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq34.gif"></inline-graphic>
</alternatives>
</inline-formula>
</td>
<td char="." align="char">709</td>
<td char="." align="char">70.90</td>
<td char="." align="char">8.85</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>For a qualitative view of the omissions, we present the 10 most frequent words of each corpus that are missing from the KJV Bible: From the WSJ corpus the words are mostly market-related:
<italic>million, Mr, says, billion, Corp, inc, shares, president, Co, sales</italic>
; in the case of 1984 they are words related to the story of the novel:
<italic>winston, party, o’brien, telescreen, big, human, don’t, merely, oceania, minutes</italic>
; and finally from CHILDES they are mostly informal, spoken constructions:
<italic>yeah, does, huh, alright, ya, okay, gonna, mhm, big, baby</italic>
. It is interesting to note that words like ‘big’ and ‘human’ are in these lists. This is due to stylistic differences as well as actual diachronic language changes.</p>
</sec>
<sec id="Sec12">
<title>Remaining problems in the parallel corpus</title>
<p>As Table 
<xref rid="Tab2" ref-type="table">2</xref>
shows, 45 out of the 100 languages contain only partial texts. In most cases this means that only the New Testament was available for that language, but in a few cases even less text exists. This means that if we want to use all 100 languages, we are limited to the smallest amount of text contained in any of them.</p>
<p>A further problem is the fact that not all the canonical verses (i.e. verses that appear in the original Greek, Hebrew and Aramaic) are present even in the official translations. One possible explanation is that the missing verses are contained in the verses that come before, or after them. This is a reasonable assumption, since in some languages it might not be easy to follow the sentence structure of the original text (e.g. a sentence that is split across two verses). For instance, in the Turkish text, verses no. 2 and 3 of chapter 7 of the Book of Genesis are combined:
<xref ref-type="fn" rid="Fn6">6</xref>
</p>
<p>
<disp-quote>
<p>GEN.7.2: Yeryüzünde soyları tükenmesin diye, yanına temiz sayılan hayvanlardan erkek ve dişi olmak üzere yedişer çift, kirli sayılan hayvanlardan birer çift, kuşlardan yedişer çift al.</p>
</disp-quote>
<disp-quote>
<p>[Gloss] Extinction on earth, lest next clean counted seven pairs of animals, including male and female, a pair of unclean animals, birds take seven pairs.</p>
</disp-quote>
<disp-quote>
<p>
<inline-formula id="IEq35">
<alternatives>
<tex-math id="M25">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_$$\end{document}</tex-math>
<mml:math id="M26">
<mml:mrow>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq35.gif"></inline-graphic>
</alternatives>
</inline-formula>
</p>
</disp-quote>
<disp-quote>
<p>GEN.7.2: Of every clean beast thou shalt take to thee by sevens, the male and his female: and of beasts that are not clean by two, the male and his female.</p>
</disp-quote>
<disp-quote>
<p>GEN.7.3: Of fowls also of the air by sevens, the male and the female; to keep seed alive upon the face of all the earth.</p>
</disp-quote>
</p>
<p>In fact, the most commonly missing verse in the New Testament is 2 Corinthians 13:14 (missing from 33 languages where the median is 2) which is a known versification difference.
<xref ref-type="fn" rid="Fn7">7</xref>
</p>
<p>Of course an alternative explanation would be that some verses were completely omitted, either intentionally or unintentionally (see footnote 7). This seems to be the case in the Swedish translation, where verse no. 29 of chapter 28 in the Book of Acts is missing and the text does not appear in either verse no. 28 or 30.</p>
<p>
<disp-quote>
<p>ACT.28.28: Det mån I därför veta: till hedningarna bar denna Guds frälsning blivit sänd; de skola ock akta därpå.</p>
</disp-quote>
<disp-quote>
<p>[Gloss] Be it known therefore know the pagans wore this salvation of God is sent; they will also hearken.</p>
</disp-quote>
<disp-quote>
<p>ACT.28.30: I två hela år bodde han sedan kvar i en bostad som han själv hade hyrt. Och alla som kommo till honom tog han emot;</p>
</disp-quote>
<disp-quote>
<p>[Gloss] For two whole years he lived then left in a residence that he had rented. And everyone who came to him, he received;</p>
</disp-quote>
<disp-quote>
<p>
<inline-formula id="IEq36">
<alternatives>
<tex-math id="M27">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_$$\end{document}</tex-math>
<mml:math id="M28">
<mml:mrow>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
<mml:mi>_</mml:mi>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq36.gif"></inline-graphic>
</alternatives>
</inline-formula>
</p>
</disp-quote>
<disp-quote>
<p>ACT.28.28: Be it known therefore unto you, that the salvation of God is sent unto the Gentiles, and that they will hear it.</p>
</disp-quote>
<disp-quote>
<p>ACT.28.29: And when he had said these words, the Jews departed, and had great reasoning among themselves.</p>
</disp-quote>
<disp-quote>
<p>ACT.28.30: And Paul dwelt two whole years in his own hired house, and received all that came in unto him,</p>
</disp-quote>
</p>
<p>There are cases where multiple verses are omitted like, for instance in the Marathi translation: the first verse of the first chapter in the Book of Ezekiel is verse no. 5, with no information about the previous four verses. However neither the single nor the continuous omissions are very frequent. When we examine all the translations that contain the full text, there are on average 19.38 single verse omissions and 9.69 continuous ones. This amounts to 0.06 and 0.03 % of the total number of verses.</p>
<p>One way to deal with these omissions would be to ignore verses in all languages where text is missing even in one of the languages in the corpus.
<xref ref-type="fn" rid="Fn8">8</xref>
Even with this drastic strategy, the overall loss of text across languages may be found to be tolerable: on average, each full bible translation contains about 643,000 words: after the elimination of all non-shared verses, we found the average word count to be about 549,000—only a 14.7 % reduction.</p>
<p>It should be noted, however, that the corpus presented here contains all available verses in all the languages (each with a unique ID as shown in Fig.
<xref rid="Fig2" ref-type="fig">2</xref>
), meaning that, depending on which subset of languages chosen, the limitations described above might not apply. Researchers are encouraged to choose their own methods to deal with these occasional unilateral omissions, whose detection is a precondition to finer-grain sentence- and word–level alignment of the kind proposed by Abney and Bird (
<xref ref-type="bibr" rid="CR1">2010</xref>
,
<xref ref-type="bibr" rid="CR2">2011</xref>
).</p>
</sec>
<sec id="Sec13" sec-type="conclusions">
<title>Conclusion</title>
<p>This paper described the creation of a massively parallel corpus, consisting of translations of the Bible in 100 languages. We discussed some of the problems arising from the nature of the texts as well as the process of gathering and annotating the online material. The texts in each language were aligned up to the level of verse in compliance with the CES guidelines. While a few more Bible translations exist in a machine-readable form (as well as a number of different translations for some languages), we believe this set of 100 languages is significantly large for an initial release. We expect to add more languages if the resource is used, and we encourage such additions by other researchers. We have released code to allow users to add more languages to the corpus as well as process the existing ones, and together with the annotated XML files, they are published under a Creative Commons license and can be found at the following address:
<ext-link ext-link-type="uri" xlink:href="http://groups.inf.ed.ac.uk/ccg/corpora.html">http://groups.inf.ed.ac.uk/ccg/corpora.html</ext-link>
.</p>
</sec>
</body>
<back>
<app-group>
<app id="App1">
<sec id="Sec14">
<title>Appendix A</title>
<p>See Table
<xref rid="Tab5" ref-type="table">5</xref>
.
<table-wrap id="Tab5">
<label>Table 5</label>
<caption>
<p>Linguistic details and available parts of the Bible corpus with approximate translation date (linguistic data and translation dates from Lewis et al.
<xref ref-type="bibr" rid="CR20">2014</xref>
)</p>
</caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left">ISO 639-3</th>
<th align="left">Language</th>
<th align="left">Family</th>
<th align="left">Genus</th>
<th align="left">Subgenus</th>
<th align="left">Speakers</th>
<th align="left">Script</th>
<th align="left">Full</th>
<th align="left">Parts</th>
<th align="left">Year</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">acu</td>
<td align="left">Achuar-Shiwiar</td>
<td align="left">Jivaroan</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">5,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1981</td>
</tr>
<tr>
<td align="left">afr</td>
<td align="left">Afrikaans</td>
<td align="left">Indo-European</td>
<td align="left">Germanic</td>
<td align="left">West</td>
<td align="left">5,000,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1953</td>
</tr>
<tr>
<td align="left">agr</td>
<td align="left">Aguaruna</td>
<td align="left">Jivaroan</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">38,300</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1973</td>
</tr>
<tr>
<td align="left">ake</td>
<td align="left">Akawaio</td>
<td align="left">Carib</td>
<td align="left">Northern</td>
<td align="left">East-West Guiana</td>
<td align="left">4,500</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">2010</td>
</tr>
<tr>
<td align="left">als</td>
<td align="left">Albanian</td>
<td align="left">Indo-European</td>
<td align="left">Albanian</td>
<td align="left">Tosk</td>
<td align="left">3,000,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1993</td>
</tr>
<tr>
<td align="left">amh</td>
<td align="left">Amharic</td>
<td align="left">Afro-Asiatic</td>
<td align="left">Semitic</td>
<td align="left">South</td>
<td align="left">17,500,000</td>
<td align="left">Ethiopic</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1840</td>
</tr>
<tr>
<td align="left">amu</td>
<td align="left">Amuzgo</td>
<td align="left">Oto-Manguean</td>
<td align="left">Amuzgoan</td>
<td align="left"></td>
<td align="left">23,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1973</td>
</tr>
<tr>
<td align="left">arb</td>
<td align="left">Arabic</td>
<td align="left">Afro-Asiatic</td>
<td align="left">Semitic</td>
<td align="left">Central</td>
<td align="left">206,000,000</td>
<td align="left">Arabic</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1865</td>
</tr>
<tr>
<td align="left">hye</td>
<td align="left">Armenian</td>
<td align="left">Indo-European</td>
<td align="left">Armenian</td>
<td align="left"></td>
<td align="left">64,00,000</td>
<td align="left">Armenian</td>
<td align="left">N</td>
<td align="left">Parts</td>
<td char="." align="char">1883</td>
</tr>
<tr>
<td align="left">djk</td>
<td align="left">Aukan</td>
<td align="left">Creole</td>
<td align="left">English based</td>
<td align="left">Atlantic</td>
<td align="left">15,500</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1999</td>
</tr>
<tr>
<td align="left">bsn</td>
<td align="left">Barasana-Eduria</td>
<td align="left">Tucanoan</td>
<td align="left">Eastern Tucanoan</td>
<td align="left">Central</td>
<td align="left">1,890</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">2001</td>
</tr>
<tr>
<td align="left">eus</td>
<td align="left">Basque</td>
<td align="left">Basque</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">700,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1855</td>
</tr>
<tr>
<td align="left">bul</td>
<td align="left">Bulgarian</td>
<td align="left">Indo-European</td>
<td align="left">Slavic</td>
<td align="left">South</td>
<td align="left">9,000,000</td>
<td align="left">Cyrillic</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1864</td>
</tr>
<tr>
<td align="left">cjp</td>
<td align="left">Cabécar</td>
<td align="left">Chibchan</td>
<td align="left">Talamanca</td>
<td align="left"></td>
<td align="left">8,840</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1993</td>
</tr>
<tr>
<td align="left">cak</td>
<td align="left">Cakchiquel</td>
<td align="left">Mayan</td>
<td align="left">Quichean</td>
<td align="left">Greater Quichean</td>
<td align="left">132,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1931</td>
</tr>
<tr>
<td align="left">cni</td>
<td align="left">Campa (Asháninka)</td>
<td align="left">Arawakan</td>
<td align="left">Maipuran</td>
<td align="left">Southern Maipuran</td>
<td align="left">26,100</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1972</td>
</tr>
<tr>
<td align="left">kbh</td>
<td align="left">Camsá</td>
<td align="left">Equatorial (?)</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">4,770</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1990</td>
</tr>
<tr>
<td align="left">ceb</td>
<td align="left">Cebuano</td>
<td align="left">Austronesian</td>
<td align="left">Malayo-Polynesian</td>
<td align="left">Phillipine</td>
<td align="left">15,800,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1917</td>
</tr>
<tr>
<td align="left">cha</td>
<td align="left">Chamorro</td>
<td align="left">Austronesian</td>
<td align="left">Malayo-Polynesian</td>
<td align="left">Chamorro</td>
<td align="left">92,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">Parts</td>
<td char="." align="char">2007</td>
</tr>
<tr>
<td align="left">chr</td>
<td align="left">Cherokee</td>
<td align="left">Iroquoian</td>
<td align="left">Southern Iroquoian</td>
<td align="left"></td>
<td align="left">16,400</td>
<td align="left">Cherokee</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1850</td>
</tr>
<tr>
<td align="left">chq</td>
<td align="left">Chinantec (Quiotepec)</td>
<td align="left">Oto-Manguean</td>
<td align="left">Chinantecan</td>
<td align="left"></td>
<td align="left">8,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1983</td>
</tr>
<tr>
<td align="left">cmn</td>
<td align="left">Chinese</td>
<td align="left">Sino-Tibetan</td>
<td align="left">Sinitic</td>
<td align="left">Chinese</td>
<td align="left">840,000,000</td>
<td align="left">Chinese</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1874</td>
</tr>
<tr>
<td align="left">cop</td>
<td align="left">Coptic</td>
<td align="left">Afro-Asiatic</td>
<td align="left">Egyptian</td>
<td align="left"></td>
<td align="left">Extinct</td>
<td align="left">Coptic</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1716</td>
</tr>
<tr>
<td align="left">hrv</td>
<td align="left">Croatian</td>
<td align="left">Indo-European</td>
<td align="left">Slavic</td>
<td align="left">South</td>
<td align="left">5,500,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1831</td>
</tr>
<tr>
<td align="left">ces</td>
<td align="left">Czech</td>
<td align="left">Indo-European</td>
<td align="left">Slavic</td>
<td align="left">West</td>
<td align="left">9,500,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1380</td>
</tr>
<tr>
<td align="left">dan</td>
<td align="left">Danish</td>
<td align="left">Indo-European</td>
<td align="left">Germanic</td>
<td align="left">North</td>
<td align="left">5,500,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1550</td>
</tr>
<tr>
<td align="left">dik</td>
<td align="left">Dinka</td>
<td align="left">Nilo-Saharan</td>
<td align="left">Eastern Sudanic</td>
<td align="left">Nilotic</td>
<td align="left">450,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">2006</td>
</tr>
<tr>
<td align="left">eng</td>
<td align="left">English</td>
<td align="left">Indo-European</td>
<td align="left">Germanic</td>
<td align="left">West</td>
<td align="left">328,000,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1611</td>
</tr>
<tr>
<td align="left">epo</td>
<td align="left">Esperanto</td>
<td align="left">Constructed</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">1,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1900</td>
</tr>
<tr>
<td align="left">est</td>
<td align="left">Estonian</td>
<td align="left">Uralic</td>
<td align="left">Finno-Ugric</td>
<td align="left">Finno-Permic</td>
<td align="left">1,000,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1739</td>
</tr>
<tr>
<td align="left">ewe</td>
<td align="left">Ewe</td>
<td align="left">Niger-Congo</td>
<td align="left">Atlantic-Congo</td>
<td align="left">Volta-Congo</td>
<td align="left">2,250,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1911</td>
</tr>
<tr>
<td align="left">pes</td>
<td align="left">Farsi (Persian)</td>
<td align="left">Indo-European</td>
<td align="left">Indo-Iranian</td>
<td align="left">Iranian</td>
<td align="left">22,000,000</td>
<td align="left">Arabic</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1838</td>
</tr>
<tr>
<td align="left">fin</td>
<td align="left">Finnish</td>
<td align="left">Uralic</td>
<td align="left">Finno-Ugric</td>
<td align="left">Finno-Permic</td>
<td align="left">5,000,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1776</td>
</tr>
<tr>
<td align="left">fra</td>
<td align="left">French</td>
<td align="left">Indo-European</td>
<td align="left">Italic</td>
<td align="left">Romance</td>
<td align="left">58,000,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1776</td>
</tr>
<tr>
<td align="left">gla</td>
<td align="left">Gaelic (Scottish)</td>
<td align="left">Indo-European</td>
<td align="left">Celtic</td>
<td align="left">Insular</td>
<td align="left">67,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">Parts</td>
<td char="." align="char">1801</td>
</tr>
<tr>
<td align="left">gbi</td>
<td align="left">Galela</td>
<td align="left">West Papuan</td>
<td align="left">North Halmahera</td>
<td align="left">Galela-Loloda</td>
<td align="left">79,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">2002</td>
</tr>
<tr>
<td align="left">deu</td>
<td align="left">German</td>
<td align="left">Indo-European</td>
<td align="left">Germanic</td>
<td align="left">West</td>
<td align="left">90,300,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1545</td>
</tr>
<tr>
<td align="left">ell</td>
<td align="left">Greek</td>
<td align="left">Indo-European</td>
<td align="left">Greek</td>
<td align="left">Attic</td>
<td align="left">13,000,000</td>
<td align="left">Greek</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1840</td>
</tr>
<tr>
<td align="left">guj</td>
<td align="left">Gujarati</td>
<td align="left">Indo-European</td>
<td align="left">Indo-Iranian</td>
<td align="left">Indo-Aryan</td>
<td align="left">45,500,000</td>
<td align="left">Gujarati</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1823</td>
</tr>
<tr>
<td align="left">hat</td>
<td align="left">Haitian Creole</td>
<td align="left">Creole</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">7,700,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1985</td>
</tr>
<tr>
<td align="left">heb</td>
<td align="left">Hebrew</td>
<td align="left">Afro-Asiatic</td>
<td align="left">Semitic</td>
<td align="left">Central</td>
<td align="left">5,300,000</td>
<td align="left">Hebrew</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1599</td>
</tr>
<tr>
<td align="left">hin</td>
<td align="left">Hindi</td>
<td align="left">Indo-European</td>
<td align="left">Indo-Iranian</td>
<td align="left">Indo-Aryan</td>
<td align="left">180,000,000</td>
<td align="left">Devanagari</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1818</td>
</tr>
<tr>
<td align="left">hun</td>
<td align="left">Hungarian</td>
<td align="left">Uralic</td>
<td align="left">Finno-Ugric</td>
<td align="left">Ugric</td>
<td align="left">12,500,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1590</td>
</tr>
<tr>
<td align="left">isl</td>
<td align="left">Icelandic</td>
<td align="left">Indo-European</td>
<td align="left">Germanic</td>
<td align="left">North</td>
<td align="left">230,000</td>
<td align="left">Ethiopic</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1863</td>
</tr>
<tr>
<td align="left">ind</td>
<td align="left">Indonesian</td>
<td align="left">Austronesian</td>
<td align="left">Malayo-Polynesian</td>
<td align="left">Malayo-Sumbawan</td>
<td align="left">23,100,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1974</td>
</tr>
<tr>
<td align="left">ita</td>
<td align="left">Italian</td>
<td align="left">Indo-European</td>
<td align="left">Italic</td>
<td align="left">Romance</td>
<td align="left">61,700,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1649</td>
</tr>
<tr>
<td align="left">jai</td>
<td align="left">Jakalteko</td>
<td align="left">Mayan</td>
<td align="left">Kanjobalan-Chujean</td>
<td align="left">Kanjobalan</td>
<td align="left">77,700</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1979</td>
</tr>
<tr>
<td align="left">jpn</td>
<td align="left">Japanese</td>
<td align="left">Japonic</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">122,000,000</td>
<td align="left">Kanjii</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1883</td>
</tr>
<tr>
<td align="left">quc</td>
<td align="left">K’iche’</td>
<td align="left">Mayan</td>
<td align="left">Quichean-Mamean</td>
<td align="left">Greater Quichean</td>
<td align="left">1,900,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1995</td>
</tr>
<tr>
<td align="left">kab</td>
<td align="left">Kabyle</td>
<td align="left">Afro-Asiatic</td>
<td align="left">Berber</td>
<td align="left">Northern</td>
<td align="left">3,100,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">2011</td>
</tr>
<tr>
<td align="left">kan</td>
<td align="left">Kannada</td>
<td align="left">Dravidian</td>
<td align="left">Southern</td>
<td align="left">Tamil-Kannada</td>
<td align="left">35,300,000</td>
<td align="left">Kannada</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1831</td>
</tr>
<tr>
<td align="left">kor</td>
<td align="left">Korean</td>
<td align="left">Altaic(?)</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">66,300,000</td>
<td align="left">Hangul</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1911</td>
</tr>
<tr>
<td align="left">lat</td>
<td align="left">Latin</td>
<td align="left">Indo-European</td>
<td align="left">Italic</td>
<td align="left">Latino-Faliscan</td>
<td align="left">Extinct</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">400</td>
</tr>
<tr>
<td align="left">lav</td>
<td align="left">Latvian</td>
<td align="left">Indo-European</td>
<td align="left">Baltic</td>
<td align="left">Eastern</td>
<td align="left">1,500,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1689</td>
</tr>
<tr>
<td align="left">lit</td>
<td align="left">Lithuanian</td>
<td align="left">Indo-European</td>
<td align="left">Baltic</td>
<td align="left">Eastern</td>
<td align="left">3,100,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1735</td>
</tr>
<tr>
<td align="left">dop</td>
<td align="left">Lukpa</td>
<td align="left">Niger-Congo</td>
<td align="left">Atlantic-Congo</td>
<td align="left">Volta-Congo</td>
<td align="left">50,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">2009</td>
</tr>
<tr>
<td align="left">plt</td>
<td align="left">Malagasy</td>
<td align="left">Austronesian</td>
<td align="left">Malayo-Polynesian</td>
<td align="left">Greater Barito</td>
<td align="left">7,520,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1835</td>
</tr>
<tr>
<td align="left">mal</td>
<td align="left">Malayalam</td>
<td align="left">Dravidian</td>
<td align="left">Southern</td>
<td align="left">Tamil-Kannada</td>
<td align="left">35,400,000</td>
<td align="left">Malayalam</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1841</td>
</tr>
<tr>
<td align="left">mam</td>
<td align="left">Mam</td>
<td align="left">Mayan</td>
<td align="left">Quichean-Mamean</td>
<td align="left">Greater Mamean</td>
<td align="left">200,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1993</td>
</tr>
<tr>
<td align="left">glv</td>
<td align="left">Manx</td>
<td align="left">Indo-European</td>
<td align="left">Celtic</td>
<td align="left">Insular</td>
<td align="left">77,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">Parts</td>
<td char="." align="char">1773</td>
</tr>
<tr>
<td align="left">mri</td>
<td align="left">Maori</td>
<td align="left">Austronesian</td>
<td align="left">Malayo-Polynesian</td>
<td align="left">Central-Eastern</td>
<td align="left">60,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1858</td>
</tr>
<tr>
<td align="left">mar</td>
<td align="left">Marathi</td>
<td align="left">Indo-European</td>
<td align="left">Indo-Iranian</td>
<td align="left">Indo-Aryan</td>
<td align="left">68,000,000</td>
<td align="left">Devanagari</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1821</td>
</tr>
<tr>
<td align="left">mya</td>
<td align="left">Myanmar (Burmese)</td>
<td align="left">Sino-Tibetan</td>
<td align="left">Tibeto-Burman</td>
<td align="left">Lolo-Burmese</td>
<td align="left">32,300,000</td>
<td align="left">Myanmar</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1835</td>
</tr>
<tr>
<td align="left">nhg</td>
<td align="left">Nahuatl (Tetelcingo)</td>
<td align="left">Uto-Aztecan</td>
<td align="left">Southern Uto-Aztecan</td>
<td align="left">Aztecan</td>
<td align="left">3,500</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1980</td>
</tr>
<tr>
<td align="left">nep</td>
<td align="left">Nepali</td>
<td align="left">Indo-European</td>
<td align="left">Indo-Iranian</td>
<td align="left">Indo-Aryan</td>
<td align="left">11,100,000</td>
<td align="left">Devanagari</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1914</td>
</tr>
<tr>
<td align="left">nor</td>
<td align="left">Norwegian</td>
<td align="left">Indo-European</td>
<td align="left">Germanic</td>
<td align="left">North</td>
<td align="left">4,600,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1904</td>
</tr>
<tr>
<td align="left">ojb</td>
<td align="left">Ojibwa</td>
<td align="left">Algic</td>
<td align="left">Algonquian</td>
<td align="left">Central</td>
<td align="left">20,000</td>
<td align="left">Aboriginal Syllabics</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1988</td>
</tr>
<tr>
<td align="left">pck</td>
<td align="left">Paite (Chin)</td>
<td align="left">Sino-Tibetan</td>
<td align="left">Tibeto-Burman</td>
<td align="left">Kuki-Chin-Naga</td>
<td align="left">78,800</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1971</td>
</tr>
<tr>
<td align="left">pol</td>
<td align="left">Polish</td>
<td align="left">Indo-European</td>
<td align="left">Slavic</td>
<td align="left">West</td>
<td align="left">36,600,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1975</td>
</tr>
<tr>
<td align="left">por</td>
<td align="left">Portuguese</td>
<td align="left">Indo-European</td>
<td align="left">Italic</td>
<td align="left">Romance</td>
<td align="left">178,000,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1751</td>
</tr>
<tr>
<td align="left">pot</td>
<td align="left">Potawatomi</td>
<td align="left">Algic</td>
<td align="left">Algonquian</td>
<td align="left">Central</td>
<td align="left">1,300,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">Parts</td>
<td char="." align="char">1844</td>
</tr>
<tr>
<td align="left">kek</td>
<td align="left">Q’eqchi’</td>
<td align="left">Mayan</td>
<td align="left">Quichean-Mamean</td>
<td align="left">Greater Quichean</td>
<td align="left">400,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1988</td>
</tr>
<tr>
<td align="left">quw</td>
<td align="left">Quichua</td>
<td align="left">Quechuan</td>
<td align="left">Quechua II</td>
<td align="left">B</td>
<td align="left">20,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1972</td>
</tr>
<tr>
<td align="left">rmn</td>
<td align="left">Romani</td>
<td align="left">Indo-European</td>
<td align="left">Indo-Iranian</td>
<td align="left">Indo-Aryan</td>
<td align="left">710,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">2008</td>
</tr>
<tr>
<td align="left">ron</td>
<td align="left">Romanian</td>
<td align="left">Indo-European</td>
<td align="left">Italic</td>
<td align="left">Romance</td>
<td align="left">23,400,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1928</td>
</tr>
<tr>
<td align="left">rus</td>
<td align="left">Russian</td>
<td align="left">Indo-European</td>
<td align="left">Slavic</td>
<td align="left">East</td>
<td align="left">143,000,000</td>
<td align="left">Cyrillic</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1876</td>
</tr>
<tr>
<td align="left">srp</td>
<td align="left">Serbian</td>
<td align="left">Indo-European</td>
<td align="left">Slavic</td>
<td align="left">South</td>
<td align="left">7,000,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1804</td>
</tr>
<tr>
<td align="left">jiv</td>
<td align="left">Shuar (Jivaro)</td>
<td align="left">Jivaroan</td>
<td align="left"></td>
<td align="left"></td>
<td align="left">46,700</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">2010</td>
</tr>
<tr>
<td align="left">slk</td>
<td align="left">Slovak</td>
<td align="left">Indo-European</td>
<td align="left">Slavic</td>
<td align="left">West</td>
<td align="left">4,610,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1832</td>
</tr>
<tr>
<td align="left">slv</td>
<td align="left">Slovene</td>
<td align="left">Indo-European</td>
<td align="left">Slavic</td>
<td align="left">South</td>
<td align="left">1,730,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1584</td>
</tr>
<tr>
<td align="left">som</td>
<td align="left">Somali</td>
<td align="left">Afro-Asiatic</td>
<td align="left">Cushitic</td>
<td align="left">East</td>
<td align="left">8,340,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1979</td>
</tr>
<tr>
<td align="left">spa</td>
<td align="left">Spanish</td>
<td align="left">Indo-European</td>
<td align="left">Italic</td>
<td align="left">Romance</td>
<td align="left">328,000,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1569</td>
</tr>
<tr>
<td align="left">swh</td>
<td align="left">Swahili</td>
<td align="left">Niger-Congo</td>
<td align="left">Atlantic-Congo</td>
<td align="left">Volta-Congo</td>
<td align="left">788,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1891</td>
</tr>
<tr>
<td align="left">swe</td>
<td align="left">Swedish</td>
<td align="left">Indo-European</td>
<td align="left">Germanic</td>
<td align="left">North</td>
<td align="left">8,300,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1917</td>
</tr>
<tr>
<td align="left">arc</td>
<td align="left">Syriac</td>
<td align="left">Afro-Asiatic</td>
<td align="left">Semitic</td>
<td align="left">Central</td>
<td align="left">Extinct</td>
<td align="left">Syriac</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">464</td>
</tr>
<tr>
<td align="left">shi</td>
<td align="left">Tachelhit</td>
<td align="left">Afro-Asiatic</td>
<td align="left">Berber</td>
<td align="left">Northern</td>
<td align="left">3,000,000</td>
<td align="left">Arabic</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">2010</td>
</tr>
<tr>
<td align="left">tgl</td>
<td align="left">Tagalog</td>
<td align="left">Austronesian</td>
<td align="left">Malayo-Polynesian</td>
<td align="left">Phillipine</td>
<td align="left">23,900,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1905</td>
</tr>
<tr>
<td align="left">ttq</td>
<td align="left">Tamajaq (Tuareg)</td>
<td align="left">Afro-Asiatic</td>
<td align="left">Berber</td>
<td align="left">Tamasheq</td>
<td align="left">640,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">Parts</td>
<td char="." align="char">1979</td>
</tr>
<tr>
<td align="left">tel</td>
<td align="left">Telugu</td>
<td align="left">Dravidian</td>
<td align="left">South-Central</td>
<td align="left">Telugu</td>
<td align="left">69,600,000</td>
<td align="left">Telugu</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1854</td>
</tr>
<tr>
<td align="left">tha</td>
<td align="left">Thai</td>
<td align="left">Tai-Kadai</td>
<td align="left">Kam-Tai</td>
<td align="left">Be-Tai</td>
<td align="left">20,300,000</td>
<td align="left">Thai</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1883</td>
</tr>
<tr>
<td align="left">tur</td>
<td align="left">Turkish</td>
<td align="left">Altaic</td>
<td align="left">Turkic</td>
<td align="left">Southern</td>
<td align="left">50,000,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1827</td>
</tr>
<tr>
<td align="left">ukr</td>
<td align="left">Ukranian</td>
<td align="left">Indo-European</td>
<td align="left">Slavic</td>
<td align="left">East</td>
<td align="left">37,000,000</td>
<td align="left">Cyrillic</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1903</td>
</tr>
<tr>
<td align="left">ppk</td>
<td align="left">Uma</td>
<td align="left">Austronesian</td>
<td align="left">Malayo-Polynesian</td>
<td align="left">Celebic</td>
<td align="left">20,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1996</td>
</tr>
<tr>
<td align="left">usp</td>
<td align="left">Uspanteco</td>
<td align="left">Mayan</td>
<td align="left">Quichean-Mamean</td>
<td align="left">Greater Quichean</td>
<td align="left">3,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1999</td>
</tr>
<tr>
<td align="left">vie</td>
<td align="left">Vietnamese</td>
<td align="left">Austro-Asiatic</td>
<td align="left">Mon-Khmer</td>
<td align="left">Viet-Muong</td>
<td align="left">68,600,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1934</td>
</tr>
<tr>
<td align="left">wal</td>
<td align="left">Wolaytta</td>
<td align="left">Afro-Asiatic</td>
<td align="left">Omotic</td>
<td align="left">North</td>
<td align="left">1,230,000</td>
<td align="left">Ethiopic</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1981</td>
</tr>
<tr>
<td align="left">wol</td>
<td align="left">Wolof</td>
<td align="left">Niger-Congo</td>
<td align="left">Atlantic-Congo</td>
<td align="left">Atlantic</td>
<td align="left">4,000,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1988</td>
</tr>
<tr>
<td align="left">xho</td>
<td align="left">Xhosa</td>
<td align="left">Niger-Congo</td>
<td align="left">Atlantic-Congo</td>
<td align="left">Volta-Congo</td>
<td align="left">7,800,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1859</td>
</tr>
<tr>
<td align="left">dje</td>
<td align="left">Zarma</td>
<td align="left">Nilo-Saharan</td>
<td align="left">Songhai</td>
<td align="left">Southern</td>
<td align="left">2,350,000</td>
<td align="left">Latin</td>
<td align="left">Y</td>
<td align="left"></td>
<td char="." align="char">1990</td>
</tr>
<tr>
<td align="left">zul</td>
<td align="left">Zulu</td>
<td align="left">Niger-Congo</td>
<td align="left">Atlantic-Congo</td>
<td align="left">Volta-Congo</td>
<td align="left">9,980,000</td>
<td align="left">Latin</td>
<td align="left">N</td>
<td align="left">NT</td>
<td char="." align="char">1883</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
</sec>
</app>
</app-group>
<fn-group>
<fn id="Fn1">
<label>1</label>
<p>There are other, non-parallel but
<italic>comparable</italic>
corpora that exist in multiple languages (like Wikipedia with more than 10,000 articles in 121 languages) but their use is limited to a few approaches (e.g. Cohen et al. (
<xref ref-type="bibr" rid="CR8">2011</xref>
), mentioned above).</p>
</fn>
<fn id="Fn2">
<label>2</label>
<p>The number of unique language pairs among
<italic>n</italic>
languages irrespective of the order in each pair is
<inline-formula id="IEq2">
<alternatives>
<tex-math id="M29">\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\left( \begin{array}{l}{n}\\ {2}\end{array}\right) } = \frac{n!}{2!(n-2)!}$$\end{document}</tex-math>
<mml:math id="M30">
<mml:mrow>
<mml:mfenced close=")" open="(" separators="">
<mml:mrow>
<mml:mtable columnspacing="0.5ex">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mi>n</mml:mi>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow></mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mfenced>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>!</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>!</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>!</mml:mo>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="10579_2014_9287_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
</p>
</fn>
<fn id="Fn3">
<label>3</label>
<p>We have tried to ensure that all translations we have used here are indeed free for research purposes and will comply with any restrictions that we have inadvertently overlooked.</p>
</fn>
<fn id="Fn4">
<label>4</label>
<p>Thanks to the anonymous reviewer for pointing us to this example.</p>
</fn>
<fn id="Fn5">
<label>5</label>
<p>Calculating the number of tokens is not always straightforward, especially in languages/scripts where words are not tokenised by spaces. In the present study we only perform white-space (and punctuation) tokenisation, which means that the numbers for specific languages (e.g. Japanese, Thai) are going to be misleading.</p>
</fn>
<fn id="Fn6">
<label>6</label>
<p>Glosses are provided using Google translate.</p>
</fn>
<fn id="Fn7">
<label>7</label>
<p>There are other known omissions of verses whose authenticity has been doubted; see
<ext-link ext-link-type="uri" xlink:href="http://en.wikipedia.org/wiki/List_of_Bible_verses_not_included_in_modern_translations">http://en.wikipedia.org/wiki/List_of_Bible_verses_not_included_in_modern_translations</ext-link>
for a list. After examining the number of languages where each verse is missing, the verses listed here are indeed missing in an above-average number of languages: most of them are in the 0.6 percentile and some in the 0.9 percentile (e.g. MAR.9.46, ACT.8.37). Thanks to an anonymous reviewer for pointing us to these cases.</p>
</fn>
<fn id="Fn8">
<label>8</label>
<p>The alternative approach would be to use a simple heuristic where if a verse is missing in any language, then its contents in all the other languages are merged with the previous verse. However, since there are no guarantees that the text is indeed present in the previous (or the next) verses, the quality of the alignment would be compromised.</p>
</fn>
</fn-group>
<ack>
<p>This material is based on research partially supported by DARPA under agreement number FA8750-13-2-0008. The U.S. Government is authorised to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government. This research was partially supported by ERC Advanced Fellowship 249520 GRAMPLUS.</p>
</ack>
<ref-list id="Bib1">
<title>References</title>
<ref id="CR1">
<mixed-citation publication-type="other">Abney, S., & Bird, S. (2010). The human language project: Building a universal corpus of the world’s languages. In
<italic> Proceedings of the 48th annual meeting of the association for computational linguistics</italic>
(pp. 88–97). Uppsala: Association for Computational Linguistics.</mixed-citation>
</ref>
<ref id="CR2">
<mixed-citation publication-type="other">Abney, S., & Bird, S. (2011). Towards a data model for the Universal Corpus. In:
<italic>Proceedings of the 4th workshop on building and using comparable corpora: Comparable corpora and the web</italic>
. ACL, pp. 120–127.</mixed-citation>
</ref>
<ref id="CR3">
<mixed-citation publication-type="other">Allen, J. D., Anderson, D., Becker, J., Cook, R., Davis, M., Edberg, P., Everson, M., Freytag, A., Jenkins, J. H., McGowan, R., Moore, L., Muller, E., Phillips, A., Suignard, M., & Whistler, K. (Eds.). (2012). The unicode standard, version 6.2. Unicode Consortium, Mountain View, CA.</mixed-citation>
</ref>
<ref id="CR4">
<mixed-citation publication-type="other">Brend, R. M., & Pike, K. L. (1977).
<italic>The summer Institute of linguistics: Its works and contributions</italic>
. The Hague: Mouton.</mixed-citation>
</ref>
<ref id="CR5">
<mixed-citation publication-type="other">Catholic Church. (2001). Liturgiam authenticam: Fifth instruction on vernacular translation of the Roman liturgy. United States Conference of Catholic Bishops, Washington, DC.</mixed-citation>
</ref>
<ref id="CR6">
<mixed-citation publication-type="other">Čermák, F., & Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics 13(3):411–427,
<ext-link ext-link-type="uri" xlink:href="http://utkl.ff.cuni.cz/%20rosen/public/2012_intercorp_ijcl.pdf">http://utkl.ff.cuni.cz/ rosen/public/2012_intercorp_ijcl.pdf</ext-link>
</mixed-citation>
</ref>
<ref id="CR7">
<mixed-citation publication-type="other">Christodoulopoulos, C. (2013). An iterated learning framework for unsupervised part-of-speech induction. PhD thesis, School of Informatics, University of Edinburgh.</mixed-citation>
</ref>
<ref id="CR8">
<mixed-citation publication-type="other">Cohen, S. B., Das, D., & Smith, N. A. (2011). Unsupervised structure prediction with non-parallel multilingual guidance. In
<italic>Proceedings of EMNLP</italic>
, pp. 50–61.</mixed-citation>
</ref>
<ref id="CR9">
<mixed-citation publication-type="other">Das, D., & Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In
<italic>Proceedings of ACL-HLT</italic>
, pp. 600–609.</mixed-citation>
</ref>
<ref id="CR10">
<mixed-citation publication-type="other">Dipper, S., & Schultz-Balluff, S. (2013). The anselm corpus: Methods and perspectives of a parallel aligned corpus. In
<italic>Proceedings of the workshop on computational historical linguistics at NODALIDA</italic>
, pp. 27–42.</mixed-citation>
</ref>
<ref id="CR11">
<mixed-citation publication-type="other">Dryer, M. S., & Haspelmath, M. (Eds.). (2013). WALS online. Max Planck Institute for Evolutionary Anthropology, Leipzig,
<ext-link ext-link-type="uri" xlink:href="http://wals.info/">http://wals.info/</ext-link>
.</mixed-citation>
</ref>
<ref id="CR12">
<mixed-citation publication-type="other">Ehrmann, M., Turchi, M., & Steinberger, R. (2011). Building a multilingual named entity-annotated corpus using annotation projection. In: RANLP, pp 118–124.</mixed-citation>
</ref>
<ref id="CR13">
<mixed-citation publication-type="other">Erjavec, T. (2004). MULTEXT-East version 3: Multilingual morphosyntactic specifications, lexicons and corpora.
<italic>Proceedings of LREC</italic>
(pp. 1535–1538). Paris: France.</mixed-citation>
</ref>
<ref id="CR14">
<mixed-citation publication-type="other">Francis, W. N. (1964). A standard sample of present-day English for use with digital computers. Tech. rep., Dept. of Linguistics, Brown University, Providence, RI, USA, report to the US Office of Education on Co-operative Research Project no. E-007.</mixed-citation>
</ref>
<ref id="CR15">
<mixed-citation publication-type="other">Germann, U. (2001). Aligned Hansards of the 36th parliament of Canada. Natural Language Group of the USC Information Sciences Institute.</mixed-citation>
</ref>
<ref id="CR16">
<mixed-citation publication-type="other">Ide, N. (1998). Encoding linguistic corpora. In
<italic>Proceedings of the sixth workshop on very Large Corpora</italic>
, pp. 9–17.</mixed-citation>
</ref>
<ref id="CR17">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kanungo</surname>
<given-names>T</given-names>
</name>
<name>
<surname>Resnik</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Mao</surname>
<given-names>S</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>D</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>Q</given-names>
</name>
</person-group>
<article-title>The Bible and multilingual optical character recognition</article-title>
<source>Communications of the ACM</source>
<year>2005</year>
<volume>48</volume>
<fpage>124</fpage>
<lpage>130</lpage>
<pub-id pub-id-type="doi">10.1145/1064830.1064837</pub-id>
</element-citation>
</ref>
<ref id="CR18">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koehn</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Europarl: A parallel corpus for statistical machine translation</article-title>
<source>MT summit</source>
<year>2005</year>
<volume>5</volume>
<fpage>79</fpage>
<lpage>86</lpage>
</element-citation>
</ref>
<ref id="CR19">
<mixed-citation publication-type="other">Landauer, T. K., & Littman, M. L. (1991). A statistical method for language-independent representation of the topical content of text segments. In: Proceedings of the Eleventh International Conference: Expert Systems and Their Applications, vol 8, p 85.</mixed-citation>
</ref>
<ref id="CR21">
<mixed-citation publication-type="other">Lewis, W. (2010). Haitian creole: How to build and ship an MT engine from scratch in 4 days, 17 hours, & 30 minutes. In
<italic>EAMT 2010: Proceedings of the 14th annual conference of the European association for machine translation</italic>
(pp. 8–13). France: Saint-Raphaël.</mixed-citation>
</ref>
<ref id="CR20">
<mixed-citation publication-type="other">Lewis, M. P., Simons, G. F., & Fennig, C. D. (Eds.). (2014).
<italic>Ethnologue: Languages of the world</italic>
(seventeenth ed.). TX: SIL international Dallas.</mixed-citation>
</ref>
<ref id="CR22">
<mixed-citation publication-type="other">MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk, Volume II: The database. Lawrence Erlbaum.</mixed-citation>
</ref>
<ref id="CR23">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marcus</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Santorini</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Marcinkiewicz</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>Building a large annotated corpus of English: The Penn Treebank</article-title>
<source>Computational Linguistics</source>
<year>1993</year>
<volume>19</volume>
<fpage>331</fpage>
<lpage>330</lpage>
</element-citation>
</ref>
<ref id="CR24">
<mixed-citation publication-type="other">Neubig, G., Matsubayashi, Y., Hagiwara, M., & Murakami, K. (2011). Safety information mining: What can NLP do in a disaster. In: IJCNLP, pp. 965–973.</mixed-citation>
</ref>
<ref id="CR25">
<mixed-citation publication-type="other">Nida, E. A., & Taber, C. R. (1969).
<italic>The theory and practice of translation</italic>
. Helps for Translators, Leiden: EJ Brill.</mixed-citation>
</ref>
<ref id="CR26">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Och</surname>
<given-names>FJ</given-names>
</name>
<name>
<surname>Ney</surname>
<given-names>H</given-names>
</name>
</person-group>
<article-title>A systematic comparison of various statistical alignment models</article-title>
<source>Computational Linguistics</source>
<year>2003</year>
<volume>29</volume>
<fpage>19</fpage>
<lpage>51</lpage>
<pub-id pub-id-type="doi">10.1162/089120103321337421</pub-id>
</element-citation>
</ref>
<ref id="CR27">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Potthast</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Barrón-Cedeño</surname>
<given-names>A</given-names>
</name>
<name>
<surname>Stein</surname>
<given-names>B</given-names>
</name>
<name>
<surname>Rosso</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Cross-language plagiarism detection</article-title>
<source>Language Resources and Evaluation</source>
<year>2011</year>
<volume>45</volume>
<issue>1</issue>
<fpage>45</fpage>
<lpage>62</lpage>
<pub-id pub-id-type="doi">10.1007/s10579-009-9114-z</pub-id>
</element-citation>
</ref>
<ref id="CR28">
<mixed-citation publication-type="other">Proctor, P. (Ed.). (1978).
<italic>Longman dictionary of contemporary English</italic>
. Harlow: Longman Group.</mixed-citation>
</ref>
<ref id="CR29">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Resnik</surname>
<given-names>P</given-names>
</name>
<name>
<surname>Olsen</surname>
<given-names>M</given-names>
</name>
<name>
<surname>Diab</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>The Bible as a parallel corpus: Annotating the “Book of 2000 Tongues”</article-title>
<source>Computers and the Humanities</source>
<year>1999</year>
<volume>33</volume>
<fpage>129</fpage>
<lpage>153</lpage>
<pub-id pub-id-type="doi">10.1023/A:1001798929185</pub-id>
</element-citation>
</ref>
<ref id="CR31">
<mixed-citation publication-type="other">Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., & Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In
<italic>Proceedings of the 5th international conference on language resources and evaluation (LREC’2006)</italic>
, Genoa, Italy.</mixed-citation>
</ref>
<ref id="CR30">
<mixed-citation publication-type="other">Steinberger, J., Lenkova, P., Kabadjov, M. A., Steinberger, R., & Van der Goot, E. (2011). Multilingual entity-centered sentiment analysis evaluated by parallel corpora. In: RANLP, pp. 770–775.</mixed-citation>
</ref>
<ref id="CR32">
<mixed-citation publication-type="other">Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2013). DGT-TM: A freely available translation memory in 22 languages. arXiv preprint
<ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/13095226">arXiv:13095226</ext-link>
.</mixed-citation>
</ref>
<ref id="CR33">
<mixed-citation publication-type="other">Tiedemann, J. (2012). Parallel data, tools and interfaces in opus. In: Chair, N. C. C., Choukri, K., Declerck, T., Doğan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., & Piperidis, S. (Eds.),
<italic>Proceedings of the eight international conference on language resources and evaluation (LREC’12)</italic>
, European Language Resources Association (ELRA), Istanbul, Turkey.</mixed-citation>
</ref>
<ref id="CR34">
<mixed-citation publication-type="other">Turchi, M., Steinberger, J., Kabadjov, M., & Steinberger, R. (2010). Using parallel corpora for multilingual (multi-document) summarisation evaluation. In
<italic>Multilingual and multimodal information access evaluation</italic>
(pp. 52–63). Berlin: Springer.</mixed-citation>
</ref>
<ref id="CR35">
<mixed-citation publication-type="other">United Bible Societies (2013) Bible translation.
<ext-link ext-link-type="uri" xlink:href="http://www.unitedbiblesocieties.org/sample-page/bible-translation">http://www.unitedbiblesocieties.org/sample-page/bible-translation</ext-link>
</mixed-citation>
</ref>
<ref id="CR36">
<mixed-citation publication-type="other">Vinokourov, A., Cristianini, N., & Shawe-taylor, J. S. (2002). Inferring a semantic representation of text via cross-language correlation analysis. In
<italic>Advances in neural information processing systems</italic>
, pp. 1473–1480.</mixed-citation>
</ref>
<ref id="CR37">
<element-citation publication-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wei</surname>
<given-names>CP</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>CC</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>CM</given-names>
</name>
</person-group>
<article-title>A latent semantic indexing-based approach to multilingual document clustering</article-title>
<source>Decision Support Systems</source>
<year>2008</year>
<volume>45</volume>
<issue>3</issue>
<fpage>606</fpage>
<lpage>620</lpage>
<pub-id pub-id-type="doi">10.1016/j.dss.2007.07.008</pub-id>
</element-citation>
</ref>
<ref id="CR38">
<mixed-citation publication-type="other">Wierzbicka, A. (2001). What did Jesus mean? Explaining the Sermon on the Mount and the parables in simple and universal human concepts. Oxford: Oxford University Press on Demand.</mixed-citation>
</ref>
<ref id="CR39">
<mixed-citation publication-type="other">Wikipedia (2013) List of literary works by number of translations.
<ext-link ext-link-type="uri" xlink:href="http://en.wikipedia.org/wiki/List_of_literary_works_by_number_of_translations">http://en.wikipedia.org/wiki/List_of_literary_works_by_number_of_translations</ext-link>
</mixed-citation>
</ref>
<ref id="CR40">
<mixed-citation publication-type="other">Yarowsky, D., & Ngai, G. (2001). Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In
<italic>Proceedings of NAACL</italic>
, pp 1–8.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Linguistique/explor/TamazightV2/Data/Pmc/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000067  | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000067  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Linguistique
   |area=    TamazightV2
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Wed Nov 15 18:28:35 2017. Site generation: Sat Feb 10 16:46:27 2024