Serveur d'exploration sur la TEI

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Bitext generation through rich markup

Identifieur interne : 000029 ( PascalFrancis/Checkpoint ); précédent : 000028; suivant : 000030

Bitext generation through rich markup

Auteurs : Arantza Casillas [Espagne] ; Raquel Martinez [Espagne]

Source :

RBID : Francis:524-05-12311

Descripteurs français

English descriptors

Abstract

This paper reports on a method for exploiting a bitext as the primary linguistic information source for the design of a generation environment for specialized bilingual documentation. The paper discusses such issues as Text Encoding Initiative (TEI), proposals for specialized corpus tagging, text segmentation and alignment of translation units and their allocation into translation memories, Document Type Definition (DTD), abstraction from tagged texts, and DTD deployment for bilingual text generation. The parallel corpus used for experimentation has two main features: 1) It contains bilingual documents from a dedicated domain of legal and administrative publications rich in specialized jargon. 2) It involves two languages, Spanish and Basque, which are typologically very distinct (both lexically and morpho-syntactically). Starting from an annotated bitext we show how Standard Generalized Markup Language (SGML) elements can be recycled to produce complementary language resources. Several translation memory databases are produced. Furthermore, DTDs for source and target documents are derived and put into correspondence. This paper discusses how these resources are automatically generated and applied to an interactive bilingual authoring system.


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

Francis:524-05-12311

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Bitext generation through rich markup</title>
<author>
<name sortKey="Casillas, Arantza" sort="Casillas, Arantza" uniqKey="Casillas A" first="Arantza" last="Casillas">Arantza Casillas</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Departamento Electridad y Electrónica, Fatultad de Ciencia y Technología, UPV-EHU</s1>
<s2>Madrid</s2>
<s3>ESP</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Espagne</country>
<placeName>
<settlement type="city">Madrid</settlement>
<region nuts="2" type="region">Communauté de Madrid</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Martinez, Raquel" sort="Martinez, Raquel" uniqKey="Martinez R" first="Raquel" last="Martinez">Raquel Martinez</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>Departamento Informática, Estadística y Telemática, Universidad Rey Juan Carlos</s1>
<s3>ESP</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Espagne</country>
<wicri:noRegion>Departamento Informática, Estadística y Telemática, Universidad Rey Juan Carlos</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">524-05-12311</idno>
<date when="2004">2004</date>
<idno type="stanalyst">FRANCIS 524-05-12311 INIST</idno>
<idno type="RBID">Francis:524-05-12311</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000038</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000048</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000029</idno>
<idno type="wicri:explorRef" wicri:stream="PascalFrancis" wicri:step="Checkpoint">000029</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Bitext generation through rich markup</title>
<author>
<name sortKey="Casillas, Arantza" sort="Casillas, Arantza" uniqKey="Casillas A" first="Arantza" last="Casillas">Arantza Casillas</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Departamento Electridad y Electrónica, Fatultad de Ciencia y Technología, UPV-EHU</s1>
<s2>Madrid</s2>
<s3>ESP</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Espagne</country>
<placeName>
<settlement type="city">Madrid</settlement>
<region nuts="2" type="region">Communauté de Madrid</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Martinez, Raquel" sort="Martinez, Raquel" uniqKey="Martinez R" first="Raquel" last="Martinez">Raquel Martinez</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>Departamento Informática, Estadística y Telemática, Universidad Rey Juan Carlos</s1>
<s3>ESP</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Espagne</country>
<wicri:noRegion>Departamento Informática, Estadística y Telemática, Universidad Rey Juan Carlos</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Computers and the humanities</title>
<title level="j" type="abbreviated">Comput. humanit.</title>
<idno type="ISSN">0010-4817</idno>
<imprint>
<date when="2004">2004</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Computers and the humanities</title>
<title level="j" type="abbreviated">Comput. humanit.</title>
<idno type="ISSN">0010-4817</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Alignment</term>
<term>Applied linguistics</term>
<term>Automatic generation</term>
<term>Bilingual text</term>
<term>Computational linguistics</term>
<term>Markup language</term>
<term>Method</term>
<term>Natural language processing</term>
<term>Parallel corpus</term>
<term>TEI</term>
<term>Translation memory</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Linguistique appliquée</term>
<term>Linguistique informatique</term>
<term>Traitement automatique des langues naturelles</term>
<term>Méthode</term>
<term>Alignement</term>
<term>Langage de balisage</term>
<term>TEI</term>
<term>Génération automatique</term>
<term>Corpus parallèle</term>
<term>Espagnol</term>
<term>Basque</term>
<term>Document structuré</term>
<term>Texte bilingue</term>
<term>Mémoire de traduction</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper reports on a method for exploiting a bitext as the primary linguistic information source for the design of a generation environment for specialized bilingual documentation. The paper discusses such issues as Text Encoding Initiative (TEI), proposals for specialized corpus tagging, text segmentation and alignment of translation units and their allocation into translation memories, Document Type Definition (DTD), abstraction from tagged texts, and DTD deployment for bilingual text generation. The parallel corpus used for experimentation has two main features: 1) It contains bilingual documents from a dedicated domain of legal and administrative publications rich in specialized jargon. 2) It involves two languages, Spanish and Basque, which are typologically very distinct (both lexically and morpho-syntactically). Starting from an annotated bitext we show how Standard Generalized Markup Language (SGML) elements can be recycled to produce complementary language resources. Several translation memory databases are produced. Furthermore, DTDs for source and target documents are derived and put into correspondence. This paper discusses how these resources are automatically generated and applied to an interactive bilingual authoring system.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>0010-4817</s0>
</fA01>
<fA02 i1="01">
<s0>COHUAD</s0>
</fA02>
<fA03 i2="1">
<s0>Comput. humanit.</s0>
</fA03>
<fA05>
<s2>38</s2>
</fA05>
<fA06>
<s2>3</s2>
</fA06>
<fA08 i1="01" i2="1" l="ENG">
<s1>Bitext generation through rich markup</s1>
</fA08>
<fA11 i1="01" i2="1">
<s1>CASILLAS (Arantza)</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>MARTINEZ (Raquel)</s1>
</fA11>
<fA14 i1="01">
<s1>Departamento Electridad y Electrónica, Fatultad de Ciencia y Technología, UPV-EHU</s1>
<s2>Madrid</s2>
<s3>ESP</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA14 i1="02">
<s1>Departamento Informática, Estadística y Telemática, Universidad Rey Juan Carlos</s1>
<s3>ESP</s3>
<sZ>2 aut.</sZ>
</fA14>
<fA20>
<s1>223-251</s1>
</fA20>
<fA21>
<s1>2004</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA43 i1="01">
<s1>INIST</s1>
<s2>14902</s2>
<s5>354000120715450010</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2005 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>1 p.3/4</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>524-05-12311</s0>
</fA47>
<fA60>
<s1>P</s1>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>Computers and the humanities</s0>
</fA64>
<fA66 i1="01">
<s0>NLD</s0>
</fA66>
<fA68 i1="01" i2="1" l="FRE">
<s1>Génération d'un bitexte à l'aide d'un marquage riche</s1>
</fA68>
<fC01 i1="01" l="ENG">
<s0>This paper reports on a method for exploiting a bitext as the primary linguistic information source for the design of a generation environment for specialized bilingual documentation. The paper discusses such issues as Text Encoding Initiative (TEI), proposals for specialized corpus tagging, text segmentation and alignment of translation units and their allocation into translation memories, Document Type Definition (DTD), abstraction from tagged texts, and DTD deployment for bilingual text generation. The parallel corpus used for experimentation has two main features: 1) It contains bilingual documents from a dedicated domain of legal and administrative publications rich in specialized jargon. 2) It involves two languages, Spanish and Basque, which are typologically very distinct (both lexically and morpho-syntactically). Starting from an annotated bitext we show how Standard Generalized Markup Language (SGML) elements can be recycled to produce complementary language resources. Several translation memory databases are produced. Furthermore, DTDs for source and target documents are derived and put into correspondence. This paper discusses how these resources are automatically generated and applied to an interactive bilingual authoring system.</s0>
</fC01>
<fC02 i1="01" i2="L">
<s0>52478</s0>
<s1>XV</s1>
</fC02>
<fC02 i1="02" i2="L">
<s0>524</s0>
</fC02>
<fC03 i1="01" i2="L" l="FRE">
<s0>Linguistique appliquée</s0>
<s2>NI</s2>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="L" l="ENG">
<s0>Applied linguistics</s0>
<s2>NI</s2>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="L" l="FRE">
<s0>Linguistique informatique</s0>
<s2>NI</s2>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="L" l="ENG">
<s0>Computational linguistics</s0>
<s2>NI</s2>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="L" l="FRE">
<s0>Traitement automatique des langues naturelles</s0>
<s2>NI</s2>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="L" l="ENG">
<s0>Natural language processing</s0>
<s2>NI</s2>
<s5>03</s5>
</fC03>
<fC03 i1="04" i2="L" l="FRE">
<s0>Méthode</s0>
<s2>NI</s2>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="L" l="ENG">
<s0>Method</s0>
<s2>NI</s2>
<s5>04</s5>
</fC03>
<fC03 i1="05" i2="L" l="FRE">
<s0>Alignement</s0>
<s2>NI</s2>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="L" l="ENG">
<s0>Alignment</s0>
<s2>NI</s2>
<s5>05</s5>
</fC03>
<fC03 i1="06" i2="L" l="FRE">
<s0>Langage de balisage</s0>
<s2>NI</s2>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="L" l="ENG">
<s0>Markup language</s0>
<s2>NI</s2>
<s5>06</s5>
</fC03>
<fC03 i1="07" i2="L" l="FRE">
<s0>TEI</s0>
<s2>NI</s2>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="L" l="ENG">
<s0>TEI</s0>
<s2>NI</s2>
<s5>07</s5>
</fC03>
<fC03 i1="08" i2="L" l="FRE">
<s0>Génération automatique</s0>
<s2>NI</s2>
<s5>08</s5>
</fC03>
<fC03 i1="08" i2="L" l="ENG">
<s0>Automatic generation</s0>
<s2>NI</s2>
<s5>08</s5>
</fC03>
<fC03 i1="09" i2="L" l="FRE">
<s0>Corpus parallèle</s0>
<s2>NI</s2>
<s5>09</s5>
</fC03>
<fC03 i1="09" i2="L" l="ENG">
<s0>Parallel corpus</s0>
<s2>NI</s2>
<s5>09</s5>
</fC03>
<fC03 i1="10" i2="L" l="FRE">
<s0>Espagnol</s0>
<s2>NL</s2>
<s5>16</s5>
</fC03>
<fC03 i1="11" i2="L" l="FRE">
<s0>Basque</s0>
<s2>NL</s2>
<s5>17</s5>
</fC03>
<fC03 i1="12" i2="L" l="FRE">
<s0>Document structuré</s0>
<s4>INC</s4>
<s5>31</s5>
</fC03>
<fC03 i1="13" i2="L" l="FRE">
<s0>Texte bilingue</s0>
<s2>NI</s2>
<s4>CD</s4>
<s5>96</s5>
</fC03>
<fC03 i1="13" i2="L" l="ENG">
<s0>Bilingual text</s0>
<s2>NI</s2>
<s4>CD</s4>
<s5>96</s5>
</fC03>
<fC03 i1="14" i2="L" l="FRE">
<s0>Mémoire de traduction</s0>
<s2>NI</s2>
<s4>CD</s4>
<s5>97</s5>
</fC03>
<fC03 i1="14" i2="L" l="ENG">
<s0>Translation memory</s0>
<s2>NI</s2>
<s4>CD</s4>
<s5>97</s5>
</fC03>
<fN21>
<s1>339</s1>
</fN21>
</pA>
</standard>
</inist>
<affiliations>
<list>
<country>
<li>Espagne</li>
</country>
<region>
<li>Communauté de Madrid</li>
</region>
<settlement>
<li>Madrid</li>
</settlement>
</list>
<tree>
<country name="Espagne">
<region name="Communauté de Madrid">
<name sortKey="Casillas, Arantza" sort="Casillas, Arantza" uniqKey="Casillas A" first="Arantza" last="Casillas">Arantza Casillas</name>
</region>
<name sortKey="Martinez, Raquel" sort="Martinez, Raquel" uniqKey="Martinez R" first="Raquel" last="Martinez">Raquel Martinez</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Ticri/explor/TeiVM2/Data/PascalFrancis/Checkpoint
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000029 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Checkpoint/biblio.hfd -nk 000029 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Ticri
   |area=    TeiVM2
   |flux=    PascalFrancis
   |étape=   Checkpoint
   |type=    RBID
   |clé=     Francis:524-05-12311
   |texte=   Bitext generation through rich markup
}}

Wicri

This area was generated with Dilib version V0.6.31.
Data generation: Mon Oct 30 21:59:18 2017. Site generation: Sun Feb 11 23:16:06 2024