Serveur d'exploration sur la TEI

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A formal framework for linguistic annotation

Identifieur interne : 000356 ( Istex/Corpus ); précédent : 000355; suivant : 000357

A formal framework for linguistic annotation

Auteurs : Steven Bird ; Mark Liberman

Source :

RBID : ISTEX:CE38257115F73D7CD5D71EEDCCC26FBE73C26383

Abstract

`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions – audio, video and/or physiological recordings – or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, coreference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focused on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.

Url:
DOI: 10.1016/S0167-6393(00)00068-6

Links to Exploration step

ISTEX:CE38257115F73D7CD5D71EEDCCC26FBE73C26383

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title>A formal framework for linguistic annotation</title>
<author>
<name sortKey="Bird, Steven" sort="Bird, Steven" uniqKey="Bird S" first="Steven" last="Bird">Steven Bird</name>
<affiliation>
<mods:affiliation>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>E-mail: sb@ldc.upenn.edu</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Liberman, Mark" sort="Liberman, Mark" uniqKey="Liberman M" first="Mark" last="Liberman">Mark Liberman</name>
<affiliation>
<mods:affiliation>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:CE38257115F73D7CD5D71EEDCCC26FBE73C26383</idno>
<date when="2001" year="2001">2001</date>
<idno type="doi">10.1016/S0167-6393(00)00068-6</idno>
<idno type="url">https://api.istex.fr/document/CE38257115F73D7CD5D71EEDCCC26FBE73C26383/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000356</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a">A formal framework for linguistic annotation</title>
<author>
<name sortKey="Bird, Steven" sort="Bird, Steven" uniqKey="Bird S" first="Steven" last="Bird">Steven Bird</name>
<affiliation>
<mods:affiliation>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>E-mail: sb@ldc.upenn.edu</mods:affiliation>
</affiliation>
</author>
<author>
<name sortKey="Liberman, Mark" sort="Liberman, Mark" uniqKey="Liberman M" first="Mark" last="Liberman">Mark Liberman</name>
<affiliation>
<mods:affiliation>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Speech Communication</title>
<title level="j" type="abbrev">SPECOM</title>
<idno type="ISSN">0167-6393</idno>
<imprint>
<publisher>ELSEVIER</publisher>
<date type="published" when="2000">2000</date>
<biblScope unit="volume">33</biblScope>
<biblScope unit="issue">1–2</biblScope>
<biblScope unit="page" from="23">23</biblScope>
<biblScope unit="page" to="60">60</biblScope>
</imprint>
<idno type="ISSN">0167-6393</idno>
</series>
<idno type="istex">CE38257115F73D7CD5D71EEDCCC26FBE73C26383</idno>
<idno type="DOI">10.1016/S0167-6393(00)00068-6</idno>
<idno type="PII">S0167-6393(00)00068-6</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0167-6393</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions – audio, video and/or physiological recordings – or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, coreference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focused on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.</div>
</front>
</TEI>
<istex>
<corpusName>elsevier</corpusName>
<author>
<json:item>
<name>Steven Bird</name>
<affiliations>
<json:string>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</json:string>
<json:string>E-mail: sb@ldc.upenn.edu</json:string>
</affiliations>
</json:item>
<json:item>
<name>Mark Liberman</name>
<affiliations>
<json:string>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</json:string>
</affiliations>
</json:item>
</author>
<subject>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Speech markup</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Speech corpus</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>General-purpose architecture</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Directed graph</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Phonological representation</value>
</json:item>
</subject>
<language>
<json:string>eng</json:string>
</language>
<originalGenre>
<json:string>Full-length article</json:string>
</originalGenre>
<abstract>`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions – audio, video and/or physiological recordings – or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, coreference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focused on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.</abstract>
<qualityIndicators>
<score>6.92</score>
<pdfVersion>1.2</pdfVersion>
<pdfPageSize>544 x 743 pts</pdfPageSize>
<refBibsNative>true</refBibsNative>
<keywordCount>5</keywordCount>
<abstractCharCount>1117</abstractCharCount>
<pdfWordCount>17491</pdfWordCount>
<pdfCharCount>103840</pdfCharCount>
<pdfPageCount>38</pdfPageCount>
<abstractWordCount>160</abstractWordCount>
</qualityIndicators>
<title>A formal framework for linguistic annotation</title>
<pii>
<json:string>S0167-6393(00)00068-6</json:string>
</pii>
<genre>
<json:string>research-article</json:string>
</genre>
<host>
<volume>33</volume>
<pii>
<json:string>S0167-6393(00)X0050-7</json:string>
</pii>
<editor>
<json:item>
<name>S. Bird and J. Harrington</name>
</json:item>
</editor>
<pages>
<last>60</last>
<first>23</first>
</pages>
<conference>
<json:item>
<name>Speech Annotation and Corpus Tools Speech Annotation</name>
</json:item>
</conference>
<issn>
<json:string>0167-6393</json:string>
</issn>
<issue>1–2</issue>
<genre>
<json:string>journal</json:string>
</genre>
<language>
<json:string>unknown</json:string>
</language>
<title>Speech Communication</title>
<publicationDate>2001</publicationDate>
</host>
<categories>
<wos>
<json:string>COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS</json:string>
<json:string>ACOUSTICS</json:string>
</wos>
</categories>
<publicationDate>2001</publicationDate>
<copyrightDate>2001</copyrightDate>
<doi>
<json:string>10.1016/S0167-6393(00)00068-6</json:string>
</doi>
<id>CE38257115F73D7CD5D71EEDCCC26FBE73C26383</id>
<score>0.15955499</score>
<fulltext>
<json:item>
<original>true</original>
<mimetype>application/pdf</mimetype>
<extension>pdf</extension>
<uri>https://api.istex.fr/document/CE38257115F73D7CD5D71EEDCCC26FBE73C26383/fulltext/pdf</uri>
</json:item>
<json:item>
<original>false</original>
<mimetype>application/zip</mimetype>
<extension>zip</extension>
<uri>https://api.istex.fr/document/CE38257115F73D7CD5D71EEDCCC26FBE73C26383/fulltext/zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/document/CE38257115F73D7CD5D71EEDCCC26FBE73C26383/fulltext/tei">
<teiHeader>
<fileDesc>
<titleStmt>
<title level="a">A formal framework for linguistic annotation</title>
</titleStmt>
<publicationStmt>
<authority>ISTEX</authority>
<publisher>ELSEVIER</publisher>
<availability>
<p>ELSEVIER</p>
</availability>
<date>2001</date>
</publicationStmt>
<notesStmt>
<note>Invited paper.</note>
<note type="content">Fig. 1: The two and three-level architectures for speech annotation.</note>
<note type="content">Fig. 2: TIMIT annotation data and graph structure.</note>
<note type="content">Fig. 3: BAS partitur annotation data and graph structure.</note>
<note type="content">Fig. 4: CHILDES annotation data and graph structure.</note>
<note type="content">Fig. 5: LACITO annotation data and graph structure.</note>
<note type="content">Fig. 6: LDC telephone speech data and graph structure.</note>
<note type="content">Fig. 7: UTF annotation data and graph structure.</note>
<note type="content">Fig. 8: Multiple annotations of the Switchboard corpus, with annotation graph.</note>
<note type="content">Fig. 9: Annotation graph for coreference example.</note>
<note type="content">Fig. 10: Gestural score for the phrase 'ten pin'.</note>
<note type="content">Fig. 11: Possible structures for a single layer.</note>
<note type="content">Fig. 12: Inter-arc linkages modeled using equivalence classes.</note>
<note type="content">Fig. 13: Sentence from Carmina 1.5 (Horace) showing dependency structure, with two annotation graphs.</note>
<note type="content">Fig. 14: Visualization for BU example.</note>
</notesStmt>
<sourceDesc>
<biblStruct type="inbook">
<analytic>
<title level="a">A formal framework for linguistic annotation</title>
<author>
<persName>
<forename type="first">Steven</forename>
<surname>Bird</surname>
</persName>
<email>sb@ldc.upenn.edu</email>
<note type="correspondence">
<p>Corresponding author. Tel.: +1-215-898-0464; fax: +1-215-573-2175</p>
</note>
<affiliation>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</affiliation>
</author>
<author>
<persName>
<forename type="first">Mark</forename>
<surname>Liberman</surname>
</persName>
<affiliation>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</affiliation>
</author>
</analytic>
<monogr>
<title level="j">Speech Communication</title>
<title level="j" type="abbrev">SPECOM</title>
<idno type="pISSN">0167-6393</idno>
<idno type="PII">S0167-6393(00)X0050-7</idno>
<meeting>
<addName>Speech Annotation and Corpus Tools</addName>
<addName>Speech Annotation</addName>
</meeting>
<editor>
<persName>S. Bird and J. Harrington</persName>
</editor>
<imprint>
<publisher>ELSEVIER</publisher>
<date type="published" when="2000"></date>
<biblScope unit="volume">33</biblScope>
<biblScope unit="issue">1–2</biblScope>
<biblScope unit="page" from="23">23</biblScope>
<biblScope unit="page" to="60">60</biblScope>
</imprint>
</monogr>
<idno type="istex">CE38257115F73D7CD5D71EEDCCC26FBE73C26383</idno>
<idno type="DOI">10.1016/S0167-6393(00)00068-6</idno>
<idno type="PII">S0167-6393(00)00068-6</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<creation>
<date>2001</date>
</creation>
<langUsage>
<language ident="en">en</language>
</langUsage>
<abstract xml:lang="en">
<p>`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions – audio, video and/or physiological recordings – or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, coreference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focused on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.</p>
</abstract>
<abstract xml:lang="de">
<p>Der Begriff `Linguistische Annotation' bezeichnet alle Arten deskriptiver oder analytischer Beschreibung von Sprachdaten. Die Ausgangsdaten können dabei entweder die Form von Zeitfunktionen haben – also z.B. Audio, Video und/oder physiologische Signale – oder als Text vorliegen. Die Annotation dagegen kann folgende Inhalte haben: alle Arten von Transkriptionen (von phonetischen Merkmalen bis zu Dialog-Strukturen), Phrasen – oder Inhalts-Segmentierung, syntaktische Analysen, Identifikation von `named entities', Querverweise innerhalb der Annotation, usw. Zwar stehen zur Zeit mehrere verschiedene Formate und Werkzeuge zur linguistischen Annotation zur Verfügung, andererseits entwickelt sich das Fehlen eines allgemein akzeptierten Standards zu einem ernsten Problem. Bisher vorgeschlagene Standards konzentrieren sich auf die Datenformate. Dieser Beitrag dagegen konzentriert sich auf die logische Struktur linguistischer Annotationen. Wir untersuchen eine breite Auswahl existierender Formate und können zeigen, daßdiesen ein gemeinsames Konzept zugrundeliegt. Dieses bildet die Grundlage für einen algebraischen Formalismus zur linguistischen Annotation, während gleichzeitig die Konsistenz zu vielen alternativen Datenstrukturen und Datenformaten erhalten bleibt.</p>
</abstract>
<abstract xml:lang="fr">
<p>Par `annotation linguistique' nous désignons toute notation descriptive ou analytique appliquée à des données langagières brutes. Ces données brutes peuvent être des signaux temporels – enregistrements audio, vidéo et/ou physiologiques – ou du texte. Les notations ajoutées peuvent être des transcriptions de toute nature (des traits phonétiques aux structures du discours), des catégories grammaticales ou sémantiques, une analyse syntaxique, l'identification d' `entités nommées', l'annotation de coréférences, etc. Malgré les efforts entrepris pour créer des formats et des outils adaptés à de telles annotations et pour diffuser des bases de données linguistiques annotées, le manque de standards largement acceptés devient un problème critique. Les standards proposés, lorsqu'ils existent, se concentrent sur les formats de fichiers. Cet article se concentre au contraire sur la structure logique des annotations linguistiques. Nous passons en revue une grande variété de formats d'annotations existants et en dégageons une structure conceptuelle commune, le graphe d'annotation. Ceci fournit un cadre formel pour construire des annotations linguistiques, les tenir à jour et y effectuer des requètes, tout en restant cohérent avec de nombreux autres structures de données et formats de fichiers.</p>
</abstract>
<textClass>
<keywords scheme="keyword">
<list>
<head>Keywords</head>
<item>
<term>Speech markup</term>
</item>
<item>
<term>Speech corpus</term>
</item>
<item>
<term>General-purpose architecture</term>
</item>
<item>
<term>Directed graph</term>
</item>
<item>
<term>Phonological representation</term>
</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc>
<change when="2000-08-02">Registration</change>
<change when="2000">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item>
<original>false</original>
<mimetype>text/plain</mimetype>
<extension>txt</extension>
<uri>https://api.istex.fr/document/CE38257115F73D7CD5D71EEDCCC26FBE73C26383/fulltext/txt</uri>
</json:item>
</fulltext>
<metadata>
<istex:metadataXml wicri:clean="Elsevier, elements deleted: ce:floats; body; tail">
<istex:xmlDeclaration>version="1.0" encoding="utf-8"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//ES//DTD journal article DTD version 4.5.2//EN//XML" URI="art452.dtd" name="istex:docType">
<istex:entity SYSTEM="gr1" NDATA="IMAGE" name="gr1"></istex:entity>
<istex:entity SYSTEM="gr2" NDATA="IMAGE" name="gr2"></istex:entity>
<istex:entity SYSTEM="gr3" NDATA="IMAGE" name="gr3"></istex:entity>
<istex:entity SYSTEM="gr4" NDATA="IMAGE" name="gr4"></istex:entity>
<istex:entity SYSTEM="gr5" NDATA="IMAGE" name="gr5"></istex:entity>
<istex:entity SYSTEM="gr6" NDATA="IMAGE" name="gr6"></istex:entity>
<istex:entity SYSTEM="gr7" NDATA="IMAGE" name="gr7"></istex:entity>
<istex:entity SYSTEM="gr8" NDATA="IMAGE" name="gr8"></istex:entity>
<istex:entity SYSTEM="gr9" NDATA="IMAGE" name="gr9"></istex:entity>
<istex:entity SYSTEM="gr10" NDATA="IMAGE" name="gr10"></istex:entity>
<istex:entity SYSTEM="gr11" NDATA="IMAGE" name="gr11"></istex:entity>
<istex:entity SYSTEM="gr12" NDATA="IMAGE" name="gr12"></istex:entity>
<istex:entity SYSTEM="gr13" NDATA="IMAGE" name="gr13"></istex:entity>
<istex:entity SYSTEM="gr14" NDATA="IMAGE" name="gr14"></istex:entity>
</istex:docType>
<istex:document>
<converted-article version="4.5.2" docsubtype="fla">
<item-info>
<jid>SPECOM</jid>
<aid>1106</aid>
<ce:pii>S0167-6393(00)00068-6</ce:pii>
<ce:doi>10.1016/S0167-6393(00)00068-6</ce:doi>
<ce:copyright type="full-transfer" year="2001">Elsevier Science B.V.</ce:copyright>
</item-info>
<head>
<ce:article-footnote>
<ce:label></ce:label>
<ce:note-para>Invited paper.</ce:note-para>
</ce:article-footnote>
<ce:title>A formal framework for linguistic annotation</ce:title>
<ce:author-group>
<ce:author>
<ce:given-name>Steven</ce:given-name>
<ce:surname>Bird</ce:surname>
<ce:cross-ref refid="CORR1">*</ce:cross-ref>
<ce:e-address>sb@ldc.upenn.edu</ce:e-address>
</ce:author>
<ce:author>
<ce:given-name>Mark</ce:given-name>
<ce:surname>Liberman</ce:surname>
</ce:author>
<ce:affiliation>
<ce:textfn>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</ce:textfn>
</ce:affiliation>
<ce:correspondence id="CORR1">
<ce:label>*</ce:label>
<ce:text>Corresponding author. Tel.: +1-215-898-0464; fax: +1-215-573-2175</ce:text>
</ce:correspondence>
</ce:author-group>
<ce:date-accepted day="2" month="8" year="2000"></ce:date-accepted>
<ce:abstract>
<ce:section-title>Abstract</ce:section-title>
<ce:abstract-sec>
<ce:simple-para>`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions – audio, video and/or physiological recordings – or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, coreference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focused on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.</ce:simple-para>
</ce:abstract-sec>
</ce:abstract>
<ce:abstract xml:lang="de">
<ce:section-title>Zusammenfassung</ce:section-title>
<ce:abstract-sec>
<ce:simple-para>Der Begriff `Linguistische Annotation' bezeichnet alle Arten deskriptiver oder analytischer Beschreibung von Sprachdaten. Die Ausgangsdaten können dabei entweder die Form von Zeitfunktionen haben – also z.B. Audio, Video und/oder physiologische Signale – oder als Text vorliegen. Die Annotation dagegen kann folgende Inhalte haben: alle Arten von Transkriptionen (von phonetischen Merkmalen bis zu Dialog-Strukturen), Phrasen – oder Inhalts-Segmentierung, syntaktische Analysen, Identifikation von `named entities', Querverweise innerhalb der Annotation, usw. Zwar stehen zur Zeit mehrere verschiedene Formate und Werkzeuge zur linguistischen Annotation zur Verfügung, andererseits entwickelt sich das Fehlen eines allgemein akzeptierten Standards zu einem ernsten Problem. Bisher vorgeschlagene Standards konzentrieren sich auf die Datenformate. Dieser Beitrag dagegen konzentriert sich auf die logische Struktur linguistischer Annotationen. Wir untersuchen eine breite Auswahl existierender Formate und können zeigen, daß
<ce:hsp sp="0.35"></ce:hsp>
diesen ein gemeinsames Konzept zugrundeliegt. Dieses bildet die Grundlage für einen algebraischen Formalismus zur linguistischen Annotation, während gleichzeitig die Konsistenz zu vielen alternativen Datenstrukturen und Datenformaten erhalten bleibt.</ce:simple-para>
</ce:abstract-sec>
</ce:abstract>
<ce:abstract xml:lang="fr">
<ce:section-title>Résumé</ce:section-title>
<ce:abstract-sec>
<ce:simple-para>Par `annotation linguistique' nous désignons toute notation descriptive ou analytique appliquée à des données langagières brutes. Ces données brutes peuvent être des signaux temporels – enregistrements audio, vidéo et/ou physiologiques – ou du texte. Les notations ajoutées peuvent être des transcriptions de toute nature (des traits phonétiques aux structures du discours), des catégories grammaticales ou sémantiques, une analyse syntaxique, l'identification d' `entités nommées', l'annotation de coréférences, etc. Malgré les efforts entrepris pour créer des formats et des outils adaptés à de telles annotations et pour diffuser des bases de données linguistiques annotées, le manque de standards largement acceptés devient un problème critique. Les standards proposés, lorsqu'ils existent, se concentrent sur les formats de fichiers. Cet article se concentre au contraire sur la structure logique des annotations linguistiques. Nous passons en revue une grande variété de formats d'annotations existants et en dégageons une structure conceptuelle commune, le graphe d'annotation. Ceci fournit un cadre formel pour construire des annotations linguistiques, les tenir à jour et y effectuer des requètes, tout en restant cohérent avec de nombreux autres structures de données et formats de fichiers.</ce:simple-para>
</ce:abstract-sec>
</ce:abstract>
<ce:keywords class="keyword">
<ce:section-title>Keywords</ce:section-title>
<ce:keyword>
<ce:text>Speech markup</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Speech corpus</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>General-purpose architecture</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Directed graph</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Phonological representation</ce:text>
</ce:keyword>
</ce:keywords>
</head>
</converted-article>
</istex:document>
</istex:metadataXml>
<mods version="3.6">
<titleInfo>
<title>A formal framework for linguistic annotation</title>
</titleInfo>
<titleInfo type="alternative" contentType="CDATA">
<title>A formal framework for linguistic annotation</title>
</titleInfo>
<name type="personal">
<namePart type="given">Steven</namePart>
<namePart type="family">Bird</namePart>
<affiliation>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</affiliation>
<affiliation>E-mail: sb@ldc.upenn.edu</affiliation>
<description>Corresponding author. Tel.: +1-215-898-0464; fax: +1-215-573-2175</description>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Mark</namePart>
<namePart type="family">Liberman</namePart>
<affiliation>Linguistic Data Consortium, University of Pennsylvania, 3615 Market Street, Philadelphia, PA 19104-2608, USA</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="Full-length article"></genre>
<originInfo>
<publisher>ELSEVIER</publisher>
<dateIssued encoding="w3cdtf">2001</dateIssued>
<copyrightDate encoding="w3cdtf">2001</copyrightDate>
</originInfo>
<language>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
</language>
<physicalDescription>
<internetMediaType>text/html</internetMediaType>
</physicalDescription>
<abstract lang="en">`Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions – audio, video and/or physiological recordings – or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis, `named entity' identification, coreference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focused on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats.</abstract>
<abstract lang="de">Der Begriff `Linguistische Annotation' bezeichnet alle Arten deskriptiver oder analytischer Beschreibung von Sprachdaten. Die Ausgangsdaten können dabei entweder die Form von Zeitfunktionen haben – also z.B. Audio, Video und/oder physiologische Signale – oder als Text vorliegen. Die Annotation dagegen kann folgende Inhalte haben: alle Arten von Transkriptionen (von phonetischen Merkmalen bis zu Dialog-Strukturen), Phrasen – oder Inhalts-Segmentierung, syntaktische Analysen, Identifikation von `named entities', Querverweise innerhalb der Annotation, usw. Zwar stehen zur Zeit mehrere verschiedene Formate und Werkzeuge zur linguistischen Annotation zur Verfügung, andererseits entwickelt sich das Fehlen eines allgemein akzeptierten Standards zu einem ernsten Problem. Bisher vorgeschlagene Standards konzentrieren sich auf die Datenformate. Dieser Beitrag dagegen konzentriert sich auf die logische Struktur linguistischer Annotationen. Wir untersuchen eine breite Auswahl existierender Formate und können zeigen, daßdiesen ein gemeinsames Konzept zugrundeliegt. Dieses bildet die Grundlage für einen algebraischen Formalismus zur linguistischen Annotation, während gleichzeitig die Konsistenz zu vielen alternativen Datenstrukturen und Datenformaten erhalten bleibt.</abstract>
<abstract lang="fr">Par `annotation linguistique' nous désignons toute notation descriptive ou analytique appliquée à des données langagières brutes. Ces données brutes peuvent être des signaux temporels – enregistrements audio, vidéo et/ou physiologiques – ou du texte. Les notations ajoutées peuvent être des transcriptions de toute nature (des traits phonétiques aux structures du discours), des catégories grammaticales ou sémantiques, une analyse syntaxique, l'identification d' `entités nommées', l'annotation de coréférences, etc. Malgré les efforts entrepris pour créer des formats et des outils adaptés à de telles annotations et pour diffuser des bases de données linguistiques annotées, le manque de standards largement acceptés devient un problème critique. Les standards proposés, lorsqu'ils existent, se concentrent sur les formats de fichiers. Cet article se concentre au contraire sur la structure logique des annotations linguistiques. Nous passons en revue une grande variété de formats d'annotations existants et en dégageons une structure conceptuelle commune, le graphe d'annotation. Ceci fournit un cadre formel pour construire des annotations linguistiques, les tenir à jour et y effectuer des requètes, tout en restant cohérent avec de nombreux autres structures de données et formats de fichiers.</abstract>
<note>Invited paper.</note>
<note type="content">Fig. 1: The two and three-level architectures for speech annotation.</note>
<note type="content">Fig. 2: TIMIT annotation data and graph structure.</note>
<note type="content">Fig. 3: BAS partitur annotation data and graph structure.</note>
<note type="content">Fig. 4: CHILDES annotation data and graph structure.</note>
<note type="content">Fig. 5: LACITO annotation data and graph structure.</note>
<note type="content">Fig. 6: LDC telephone speech data and graph structure.</note>
<note type="content">Fig. 7: UTF annotation data and graph structure.</note>
<note type="content">Fig. 8: Multiple annotations of the Switchboard corpus, with annotation graph.</note>
<note type="content">Fig. 9: Annotation graph for coreference example.</note>
<note type="content">Fig. 10: Gestural score for the phrase 'ten pin'.</note>
<note type="content">Fig. 11: Possible structures for a single layer.</note>
<note type="content">Fig. 12: Inter-arc linkages modeled using equivalence classes.</note>
<note type="content">Fig. 13: Sentence from Carmina 1.5 (Horace) showing dependency structure, with two annotation graphs.</note>
<note type="content">Fig. 14: Visualization for BU example.</note>
<subject>
<genre>Keywords</genre>
<topic>Speech markup</topic>
<topic>Speech corpus</topic>
<topic>General-purpose architecture</topic>
<topic>Directed graph</topic>
<topic>Phonological representation</topic>
</subject>
<relatedItem type="host">
<titleInfo>
<title>Speech Communication</title>
</titleInfo>
<titleInfo type="abbreviated">
<title>SPECOM</title>
</titleInfo>
<name type="conference">
<namePart>Speech Annotation and Corpus Tools</namePart>
<namePart>Speech Annotation</namePart>
</name>
<name type="personal">
<namePart>S. Bird and J. Harrington</namePart>
<role>
<roleTerm type="text">editor</roleTerm>
</role>
</name>
<genre type="journal">journal</genre>
<originInfo>
<dateIssued encoding="w3cdtf">200101</dateIssued>
</originInfo>
<identifier type="ISSN">0167-6393</identifier>
<identifier type="PII">S0167-6393(00)X0050-7</identifier>
<part>
<date>200101</date>
<detail type="issue">
<title>Speech Annotation and Corpus Tools</title>
</detail>
<detail type="volume">
<number>33</number>
<caption>vol.</caption>
</detail>
<detail type="issue">
<number>1–2</number>
<caption>no.</caption>
</detail>
<extent unit="issue pages">
<start>1</start>
<end>176</end>
</extent>
<extent unit="pages">
<start>23</start>
<end>60</end>
</extent>
</part>
</relatedItem>
<identifier type="istex">CE38257115F73D7CD5D71EEDCCC26FBE73C26383</identifier>
<identifier type="DOI">10.1016/S0167-6393(00)00068-6</identifier>
<identifier type="PII">S0167-6393(00)00068-6</identifier>
<accessCondition type="use and reproduction" contentType="">© 2001Elsevier Science B.V.</accessCondition>
<recordInfo>
<recordContentSource>ELSEVIER</recordContentSource>
<recordOrigin>Elsevier Science B.V., ©2001</recordOrigin>
</recordInfo>
</mods>
</metadata>
<enrichments>
<istex:catWosTEI uri="https://api.istex.fr/document/CE38257115F73D7CD5D71EEDCCC26FBE73C26383/enrichments/catWos">
<teiHeader>
<profileDesc>
<textClass>
<classCode scheme="WOS">COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS</classCode>
<classCode scheme="WOS">ACOUSTICS</classCode>
</textClass>
</profileDesc>
</teiHeader>
</istex:catWosTEI>
</enrichments>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Ticri/explor/TeiVM2/Data/Istex/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000356 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 000356 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Ticri
   |area=    TeiVM2
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:CE38257115F73D7CD5D71EEDCCC26FBE73C26383
   |texte=   A formal framework for linguistic annotation
}}

Wicri

This area was generated with Dilib version V0.6.31.
Data generation: Mon Oct 30 21:59:18 2017. Site generation: Sun Feb 11 23:16:06 2024