Serveur d'exploration Cyberinfrastructure

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages

Identifieur interne : 000469 ( Istex/Corpus ); précédent : 000468; suivant : 000470

Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages

Auteurs : William D. Lewis ; Fei Xia

Source :

RBID : ISTEX:A068F97DB382C1195C173F6FD8944FBDEC4E9409

Abstract

In this article, we review the process of building ODIN, the Online Database of Interlinear Text (http://odin.linguistlist.org) a multilingual repository of linguistically analyzed language data. ODIN is built from interlinear text that has been harvested from scholarly linguistic documents posted on the web. At the time of this writing, ODIN holds nearly 190,000 instances of interlinear text representing annotated language data for more than 1,000 languages (representing data from >10 of the world's languages). ODIN's charter has been to make these data available to linguists and other language researchers via search, providing the facility to find instances of language data and related resources (i.e. the documents from which data were extracted) by language name, language family, and even annotations used to markup the data (e.g. NOM, ACC, ERG, PST, 3SG). Further, we have sought to enrich the data we have collected and extract knowledge from the enriched content. To enrich the data, we use a variety of statistical tagging and parsing methods applied in the English translations. An enhanced search facility allows users to find data across languages for a variety of syntactic constructions and constituent orders, facilitating unprecedented automated and online discovery of language data.

Url:
DOI: 10.1093/llc/fqq006

Links to Exploration step

ISTEX:A068F97DB382C1195C173F6FD8944FBDEC4E9409

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages</title>
<author wicri:is="90%">
<name sortKey="Lewis, William D" sort="Lewis, William D" uniqKey="Lewis W" first="William D." last="Lewis">William D. Lewis</name>
<affiliation>
<mods:affiliation>Microsoft Research, USA</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>E-mail: wilewis@microsoft.com</mods:affiliation>
</affiliation>
</author>
<author wicri:is="90%">
<name sortKey="Xia, Fei" sort="Xia, Fei" uniqKey="Xia F" first="Fei" last="Xia">Fei Xia</name>
<affiliation>
<mods:affiliation>Department of Linguistics, University of Washington, USA</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:A068F97DB382C1195C173F6FD8944FBDEC4E9409</idno>
<date when="2010" year="2010">2010</date>
<idno type="doi">10.1093/llc/fqq006</idno>
<idno type="url">https://api.istex.fr/document/A068F97DB382C1195C173F6FD8944FBDEC4E9409/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000469</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a">Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages</title>
<author wicri:is="90%">
<name sortKey="Lewis, William D" sort="Lewis, William D" uniqKey="Lewis W" first="William D." last="Lewis">William D. Lewis</name>
<affiliation>
<mods:affiliation>Microsoft Research, USA</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>E-mail: wilewis@microsoft.com</mods:affiliation>
</affiliation>
</author>
<author wicri:is="90%">
<name sortKey="Xia, Fei" sort="Xia, Fei" uniqKey="Xia F" first="Fei" last="Xia">Fei Xia</name>
<affiliation>
<mods:affiliation>Department of Linguistics, University of Washington, USA</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Literary and Linguistic Computing</title>
<idno type="ISSN">0268-1145</idno>
<idno type="eISSN">1477-4615</idno>
<imprint>
<publisher>Oxford University Press</publisher>
<date type="published" when="2010-09">2010-09</date>
<biblScope unit="volume">25</biblScope>
<biblScope unit="issue">3</biblScope>
<biblScope unit="page" from="303">303</biblScope>
<biblScope unit="page" to="319">319</biblScope>
</imprint>
<idno type="ISSN">0268-1145</idno>
</series>
<idno type="istex">A068F97DB382C1195C173F6FD8944FBDEC4E9409</idno>
<idno type="DOI">10.1093/llc/fqq006</idno>
<idno type="ArticleID">fqq006</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0268-1145</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract">In this article, we review the process of building ODIN, the Online Database of Interlinear Text (http://odin.linguistlist.org) a multilingual repository of linguistically analyzed language data. ODIN is built from interlinear text that has been harvested from scholarly linguistic documents posted on the web. At the time of this writing, ODIN holds nearly 190,000 instances of interlinear text representing annotated language data for more than 1,000 languages (representing data from >10 of the world's languages). ODIN's charter has been to make these data available to linguists and other language researchers via search, providing the facility to find instances of language data and related resources (i.e. the documents from which data were extracted) by language name, language family, and even annotations used to markup the data (e.g. NOM, ACC, ERG, PST, 3SG). Further, we have sought to enrich the data we have collected and extract knowledge from the enriched content. To enrich the data, we use a variety of statistical tagging and parsing methods applied in the English translations. An enhanced search facility allows users to find data across languages for a variety of syntactic constructions and constituent orders, facilitating unprecedented automated and online discovery of language data.</div>
</front>
</TEI>
<istex>
<corpusName>oup</corpusName>
<author>
<json:item>
<name>William D. Lewis</name>
<affiliations>
<json:string>Microsoft Research, USA</json:string>
<json:string>E-mail: wilewis@microsoft.com</json:string>
</affiliations>
</json:item>
<json:item>
<name>Fei Xia</name>
<affiliations>
<json:string>Department of Linguistics, University of Washington, USA</json:string>
</affiliations>
</json:item>
</author>
<subject>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Original Articles</value>
</json:item>
</subject>
<articleId>
<json:string>fqq006</json:string>
</articleId>
<language>
<json:string>eng</json:string>
</language>
<originalGenre>
<json:string>research-article</json:string>
</originalGenre>
<abstract>In this article, we review the process of building ODIN, the Online Database of Interlinear Text (http://odin.linguistlist.org) a multilingual repository of linguistically analyzed language data. ODIN is built from interlinear text that has been harvested from scholarly linguistic documents posted on the web. At the time of this writing, ODIN holds nearly 190,000 instances of interlinear text representing annotated language data for more than 1,000 languages (representing data from >10 of the world's languages). ODIN's charter has been to make these data available to linguists and other language researchers via search, providing the facility to find instances of language data and related resources (i.e. the documents from which data were extracted) by language name, language family, and even annotations used to markup the data (e.g. NOM, ACC, ERG, PST, 3SG). Further, we have sought to enrich the data we have collected and extract knowledge from the enriched content. To enrich the data, we use a variety of statistical tagging and parsing methods applied in the English translations. An enhanced search facility allows users to find data across languages for a variety of syntactic constructions and constituent orders, facilitating unprecedented automated and online discovery of language data.</abstract>
<qualityIndicators>
<score>7.852</score>
<pdfVersion>1.4</pdfVersion>
<pdfPageSize>538.583 x 697.323 pts</pdfPageSize>
<refBibsNative>true</refBibsNative>
<keywordCount>1</keywordCount>
<abstractCharCount>1309</abstractCharCount>
<pdfWordCount>7881</pdfWordCount>
<pdfCharCount>50627</pdfCharCount>
<pdfPageCount>17</pdfPageCount>
<abstractWordCount>196</abstractWordCount>
</qualityIndicators>
<title>Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages</title>
<genre>
<json:string>research-article</json:string>
</genre>
<host>
<volume>25</volume>
<publisherId>
<json:string>litlin</json:string>
</publisherId>
<pages>
<last>319</last>
<first>303</first>
</pages>
<issn>
<json:string>0268-1145</json:string>
</issn>
<issue>3</issue>
<genre>
<json:string>journal</json:string>
</genre>
<language>
<json:string>unknown</json:string>
</language>
<eissn>
<json:string>1477-4615</json:string>
</eissn>
<title>Literary and Linguistic Computing</title>
</host>
<categories>
<wos>
<json:string>LINGUISTICS</json:string>
<json:string>LITERATURE</json:string>
</wos>
</categories>
<publicationDate>2010</publicationDate>
<copyrightDate>2010</copyrightDate>
<doi>
<json:string>10.1093/llc/fqq006</json:string>
</doi>
<id>A068F97DB382C1195C173F6FD8944FBDEC4E9409</id>
<score>0.15717497</score>
<fulltext>
<json:item>
<original>true</original>
<mimetype>application/pdf</mimetype>
<extension>pdf</extension>
<uri>https://api.istex.fr/document/A068F97DB382C1195C173F6FD8944FBDEC4E9409/fulltext/pdf</uri>
</json:item>
<json:item>
<original>false</original>
<mimetype>application/zip</mimetype>
<extension>zip</extension>
<uri>https://api.istex.fr/document/A068F97DB382C1195C173F6FD8944FBDEC4E9409/fulltext/zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/document/A068F97DB382C1195C173F6FD8944FBDEC4E9409/fulltext/tei">
<teiHeader>
<fileDesc>
<titleStmt>
<title level="a">Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages</title>
</titleStmt>
<publicationStmt>
<authority>ISTEX</authority>
<publisher>Oxford University Press</publisher>
<availability>
<p>OUP</p>
</availability>
<date>2010-05-07</date>
</publicationStmt>
<sourceDesc>
<biblStruct type="inbook">
<analytic>
<title level="a">Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages</title>
<author>
<persName>
<forename type="first">William D.</forename>
<surname>Lewis</surname>
</persName>
<email>wilewis@microsoft.com</email>
<affiliation>Microsoft Research, USA</affiliation>
</author>
<author>
<persName>
<forename type="first">Fei</forename>
<surname>Xia</surname>
</persName>
<affiliation>Department of Linguistics, University of Washington, USA</affiliation>
</author>
</analytic>
<monogr>
<title level="j">Literary and Linguistic Computing</title>
<idno type="pISSN">0268-1145</idno>
<idno type="eISSN">1477-4615</idno>
<imprint>
<publisher>Oxford University Press</publisher>
<date type="published" when="2010-09"></date>
<biblScope unit="volume">25</biblScope>
<biblScope unit="issue">3</biblScope>
<biblScope unit="page" from="303">303</biblScope>
<biblScope unit="page" to="319">319</biblScope>
</imprint>
</monogr>
<idno type="istex">A068F97DB382C1195C173F6FD8944FBDEC4E9409</idno>
<idno type="DOI">10.1093/llc/fqq006</idno>
<idno type="ArticleID">fqq006</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<creation>
<date>2010-05-07</date>
</creation>
<langUsage>
<language ident="en">en</language>
</langUsage>
<abstract>
<p>In this article, we review the process of building ODIN, the Online Database of Interlinear Text (http://odin.linguistlist.org) a multilingual repository of linguistically analyzed language data. ODIN is built from interlinear text that has been harvested from scholarly linguistic documents posted on the web. At the time of this writing, ODIN holds nearly 190,000 instances of interlinear text representing annotated language data for more than 1,000 languages (representing data from >10 of the world's languages). ODIN's charter has been to make these data available to linguists and other language researchers via search, providing the facility to find instances of language data and related resources (i.e. the documents from which data were extracted) by language name, language family, and even annotations used to markup the data (e.g. NOM, ACC, ERG, PST, 3SG). Further, we have sought to enrich the data we have collected and extract knowledge from the enriched content. To enrich the data, we use a variety of statistical tagging and parsing methods applied in the English translations. An enhanced search facility allows users to find data across languages for a variety of syntactic constructions and constituent orders, facilitating unprecedented automated and online discovery of language data.</p>
</abstract>
<textClass>
<keywords scheme="keyword">
<list>
<item>
<term>Original Articles</term>
</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc>
<change when="2010-05-07">Created</change>
<change when="2010-09">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item>
<original>false</original>
<mimetype>text/plain</mimetype>
<extension>txt</extension>
<uri>https://api.istex.fr/document/A068F97DB382C1195C173F6FD8944FBDEC4E9409/fulltext/txt</uri>
</json:item>
</fulltext>
<metadata>
<istex:metadataXml wicri:clean="corpus oup" wicri:toSee="no header">
<istex:xmlDeclaration>version="1.0" encoding="utf-8"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" URI="journalpublishing.dtd" name="istex:docType"></istex:docType>
<istex:document>
<article article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">litlin</journal-id>
<journal-id journal-id-type="hwp">litlin</journal-id>
<journal-title>Literary and Linguistic Computing</journal-title>
<issn pub-type="ppub">0268-1145</issn>
<issn pub-type="epub">1477-4615</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.1093/llc/fqq006</article-id>
<article-id pub-id-type="publisher-id">fqq006</article-id>
<article-categories>
<subj-group>
<subject>Original Articles</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Lewis</surname>
<given-names>William D.</given-names>
</name>
</contrib>
<aff>Microsoft Research, USA</aff>
</contrib-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Xia</surname>
<given-names>Fei</given-names>
</name>
</contrib>
<aff>Department of Linguistics, University of Washington, USA</aff>
</contrib-group>
<author-notes>
<corresp>
<bold>Correspondence:</bold>
William D. Lewis, Microsoft Research, One Microsoft Way, 99\1637, Redmond, WA 98052, USA
<bold>E-mail:</bold>
<email>wilewis@microsoft.com</email>
.</corresp>
</author-notes>
<pub-date pub-type="ppub">
<month>9</month>
<year>2010</year>
</pub-date>
<pub-date pub-type="epub">
<day>7</day>
<month>5</month>
<year>2010</year>
</pub-date>
<volume>25</volume>
<issue>3</issue>
<issue-title>Journal of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities</issue-title>
<fpage>303</fpage>
<lpage>319</lpage>
<permissions>
<copyright-statement>© The Author 2010. Published by Oxford University Press on behalf of ALLC and ACH. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org</copyright-statement>
<copyright-year>2010</copyright-year>
</permissions>
<abstract>
<p>In this article, we review the process of building ODIN, the Online Database of Interlinear Text (
<ext-link ext-link-type="uri" xlink:href="http://odin.linguistlist.org">http://odin.linguistlist.org</ext-link>
) a multilingual repository of linguistically analyzed language data. ODIN is built from interlinear text that has been harvested from scholarly linguistic documents posted on the web. At the time of this writing, ODIN holds nearly 190,000 instances of interlinear text representing annotated language data for more than 1,000 languages (representing data from >10% of the world's languages). ODIN's charter has been to make these data available to linguists and other language researchers via search, providing the facility to find instances of language data and related resources (i.e. the documents from which data were extracted) by language name, language family, and even annotations used to markup the data (e.g. NOM, ACC, ERG, PST, 3SG). Further, we have sought to enrich the data we have collected and extract ‘knowledge’ from the enriched content. To enrich the data, we use a variety of statistical tagging and parsing methods applied in the English translations. An enhanced search facility allows users to find data across languages for a variety of syntactic constructions and constituent orders, facilitating unprecedented automated and online discovery of language data.</p>
</abstract>
</article-meta>
</front>
</article>
</istex:document>
</istex:metadataXml>
<mods version="3.6">
<titleInfo>
<title>Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages</title>
</titleInfo>
<titleInfo type="alternative" contentType="CDATA">
<title>Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages</title>
</titleInfo>
<name type="personal">
<namePart type="given">William D.</namePart>
<namePart type="family">Lewis</namePart>
<affiliation>Microsoft Research, USA</affiliation>
<affiliation>E-mail: wilewis@microsoft.com</affiliation>
</name>
<name type="personal">
<namePart type="given">Fei</namePart>
<namePart type="family">Xia</namePart>
<affiliation>Department of Linguistics, University of Washington, USA</affiliation>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="research-article"></genre>
<subject>
<topic>Original Articles</topic>
</subject>
<originInfo>
<publisher>Oxford University Press</publisher>
<dateIssued encoding="w3cdtf">2010-09</dateIssued>
<dateCreated encoding="w3cdtf">2010-05-07</dateCreated>
<copyrightDate encoding="w3cdtf">2010</copyrightDate>
</originInfo>
<language>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
</language>
<physicalDescription>
<internetMediaType>text/html</internetMediaType>
</physicalDescription>
<abstract>In this article, we review the process of building ODIN, the Online Database of Interlinear Text (http://odin.linguistlist.org) a multilingual repository of linguistically analyzed language data. ODIN is built from interlinear text that has been harvested from scholarly linguistic documents posted on the web. At the time of this writing, ODIN holds nearly 190,000 instances of interlinear text representing annotated language data for more than 1,000 languages (representing data from >10 of the world's languages). ODIN's charter has been to make these data available to linguists and other language researchers via search, providing the facility to find instances of language data and related resources (i.e. the documents from which data were extracted) by language name, language family, and even annotations used to markup the data (e.g. NOM, ACC, ERG, PST, 3SG). Further, we have sought to enrich the data we have collected and extract knowledge from the enriched content. To enrich the data, we use a variety of statistical tagging and parsing methods applied in the English translations. An enhanced search facility allows users to find data across languages for a variety of syntactic constructions and constituent orders, facilitating unprecedented automated and online discovery of language data.</abstract>
<relatedItem type="host">
<titleInfo>
<title>Literary and Linguistic Computing</title>
</titleInfo>
<genre type="journal">journal</genre>
<identifier type="ISSN">0268-1145</identifier>
<identifier type="eISSN">1477-4615</identifier>
<identifier type="PublisherID">litlin</identifier>
<identifier type="PublisherID-hwp">litlin</identifier>
<part>
<date>2010</date>
<detail type="title">
<title>Journal of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities</title>
</detail>
<detail type="volume">
<caption>vol.</caption>
<number>25</number>
</detail>
<detail type="issue">
<caption>no.</caption>
<number>3</number>
</detail>
<extent unit="pages">
<start>303</start>
<end>319</end>
</extent>
</part>
</relatedItem>
<identifier type="istex">A068F97DB382C1195C173F6FD8944FBDEC4E9409</identifier>
<identifier type="DOI">10.1093/llc/fqq006</identifier>
<identifier type="ArticleID">fqq006</identifier>
<accessCondition type="use and reproduction" contentType="copyright">The Author 2010. Published by Oxford University Press on behalf of ALLC and ACH. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org</accessCondition>
<recordInfo>
<recordContentSource>OUP</recordContentSource>
</recordInfo>
</mods>
</metadata>
<covers>
<json:item>
<original>true</original>
<mimetype>image/tiff</mimetype>
<extension>tiff</extension>
<uri>https://api.istex.fr/document/A068F97DB382C1195C173F6FD8944FBDEC4E9409/covers/tiff</uri>
</json:item>
</covers>
<annexes>
<json:item>
<original>true</original>
<mimetype>image/jpeg</mimetype>
<extension>jpeg</extension>
<uri>https://api.istex.fr/document/A068F97DB382C1195C173F6FD8944FBDEC4E9409/annexes/jpeg</uri>
</json:item>
<json:item>
<original>true</original>
<mimetype>image/gif</mimetype>
<extension>gif</extension>
<uri>https://api.istex.fr/document/A068F97DB382C1195C173F6FD8944FBDEC4E9409/annexes/gif</uri>
</json:item>
<json:item>
<original>true</original>
<mimetype>application/pdf</mimetype>
<extension>pdf</extension>
<uri>https://api.istex.fr/document/A068F97DB382C1195C173F6FD8944FBDEC4E9409/annexes/pdf</uri>
</json:item>
</annexes>
<enrichments>
<istex:catWosTEI uri="https://api.istex.fr/document/A068F97DB382C1195C173F6FD8944FBDEC4E9409/enrichments/catWos">
<teiHeader>
<profileDesc>
<textClass>
<classCode scheme="WOS">LINGUISTICS</classCode>
<classCode scheme="WOS">LITERATURE</classCode>
</textClass>
</profileDesc>
</teiHeader>
</istex:catWosTEI>
</enrichments>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/CyberinfraV1/Data/Istex/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000469 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 000469 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    CyberinfraV1
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:A068F97DB382C1195C173F6FD8944FBDEC4E9409
   |texte=   Developing ODIN: A Multilingual Repository of Annotated Language Data for Hundreds of the World's Languages
}}

Wicri

This area was generated with Dilib version V0.6.25.
Data generation: Thu Oct 27 09:30:58 2016. Site generation: Sun Mar 10 23:08:40 2024