Serveur d'exploration sur la TEI

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The Scottish Corpus of Texts and Speech: Problems of Corpus Design

Identifieur interne : 000464 ( Istex/Corpus ); précédent : 000463; suivant : 000465

The Scottish Corpus of Texts and Speech: Problems of Corpus Design

Auteurs : Fiona M. Douglas

Source :

RBID : ISTEX:8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E

Abstract

In recent years, the use of large corpora has revolutionized the way we study language. There are now numerous well‐established corpus projects, which have set the standard for future corpus‐based research. As more and more corpora are developed and technology continues to offer greater and greater scope, the emphasis has shifted from corpus size to establishing norms of good practice. There is also an increasingly critical appreciation of the crucial role played by corpus design. Corpus design can, however, present peculiar problems for particular types of source material. The Scottish Corpus of Texts and Speech (SCOTS) is the first large‐scale corpus project specifically dedicated to the languages of Scotland, and therefore it faces many unanswered questions, which will have a direct impact on the corpus design. The first phase of the project will focus on the language varieties Scots and Scottish English, varieties that are themselves notoriously difficult to define. This paper outlines the complexities of the Scottish linguistic situation, before going on to examine the problematic issue of how to construct a well‐balanced and representative corpus in what is largely uncharted territory. It argues that a well‐formed corpus cannot be constructed in a linguistic vacuum, and that familiarity with the overall language population is essential before effective corpus sampling techniques, methodologies, and categorization schema can be devised. It also offers some preliminary methodologies that will be adopted by SCOTS.

Url:
DOI: 10.1093/llc/18.1.23

Links to Exploration step

ISTEX:8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">The Scottish Corpus of Texts and Speech: Problems of Corpus Design</title>
<author wicri:is="90%">
<name sortKey="Douglas, Fiona M" sort="Douglas, Fiona M" uniqKey="Douglas F" first="Fiona M." last="Douglas">Fiona M. Douglas</name>
<affiliation>
<mods:affiliation>University of Glasgow, Glasgow, UK</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E</idno>
<date when="2003" year="2003">2003</date>
<idno type="doi">10.1093/llc/18.1.23</idno>
<idno type="url">https://api.istex.fr/document/8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000464</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">The Scottish Corpus of Texts and Speech: Problems of Corpus Design</title>
<author wicri:is="90%">
<name sortKey="Douglas, Fiona M" sort="Douglas, Fiona M" uniqKey="Douglas F" first="Fiona M." last="Douglas">Fiona M. Douglas</name>
<affiliation>
<mods:affiliation>University of Glasgow, Glasgow, UK</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Literary and Linguistic Computing</title>
<title level="j" type="abbrev">Lit Linguist Computing</title>
<idno type="ISSN">0268-1145</idno>
<idno type="eISSN">1477-4615</idno>
<imprint>
<publisher>Oxford University Press</publisher>
<date type="published" when="2003-04">2003-04</date>
<biblScope unit="volume">18</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="23">23</biblScope>
<biblScope unit="page" to="37">37</biblScope>
</imprint>
<idno type="ISSN">0268-1145</idno>
</series>
<idno type="istex">8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E</idno>
<idno type="DOI">10.1093/llc/18.1.23</idno>
<idno type="local">180023</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0268-1145</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">In recent years, the use of large corpora has revolutionized the way we study language. There are now numerous well‐established corpus projects, which have set the standard for future corpus‐based research. As more and more corpora are developed and technology continues to offer greater and greater scope, the emphasis has shifted from corpus size to establishing norms of good practice. There is also an increasingly critical appreciation of the crucial role played by corpus design. Corpus design can, however, present peculiar problems for particular types of source material. The Scottish Corpus of Texts and Speech (SCOTS) is the first large‐scale corpus project specifically dedicated to the languages of Scotland, and therefore it faces many unanswered questions, which will have a direct impact on the corpus design. The first phase of the project will focus on the language varieties Scots and Scottish English, varieties that are themselves notoriously difficult to define. This paper outlines the complexities of the Scottish linguistic situation, before going on to examine the problematic issue of how to construct a well‐balanced and representative corpus in what is largely uncharted territory. It argues that a well‐formed corpus cannot be constructed in a linguistic vacuum, and that familiarity with the overall language population is essential before effective corpus sampling techniques, methodologies, and categorization schema can be devised. It also offers some preliminary methodologies that will be adopted by SCOTS.</div>
</front>
</TEI>
<istex>
<corpusName>oup</corpusName>
<author>
<json:item>
<name>Fiona M. Douglas</name>
<affiliations>
<json:string>University of Glasgow, Glasgow, UK</json:string>
</affiliations>
</json:item>
</author>
<language>
<json:string>eng</json:string>
</language>
<originalGenre>
<json:string>research-article</json:string>
</originalGenre>
<abstract>In recent years, the use of large corpora has revolutionized the way we study language. There are now numerous well‐established corpus projects, which have set the standard for future corpus‐based research. As more and more corpora are developed and technology continues to offer greater and greater scope, the emphasis has shifted from corpus size to establishing norms of good practice. There is also an increasingly critical appreciation of the crucial role played by corpus design. Corpus design can, however, present peculiar problems for particular types of source material. The Scottish Corpus of Texts and Speech (SCOTS) is the first large‐scale corpus project specifically dedicated to the languages of Scotland, and therefore it faces many unanswered questions, which will have a direct impact on the corpus design. The first phase of the project will focus on the language varieties Scots and Scottish English, varieties that are themselves notoriously difficult to define. This paper outlines the complexities of the Scottish linguistic situation, before going on to examine the problematic issue of how to construct a well‐balanced and representative corpus in what is largely uncharted territory. It argues that a well‐formed corpus cannot be constructed in a linguistic vacuum, and that familiarity with the overall language population is essential before effective corpus sampling techniques, methodologies, and categorization schema can be devised. It also offers some preliminary methodologies that will be adopted by SCOTS.</abstract>
<qualityIndicators>
<score>7.76</score>
<pdfVersion>1.3</pdfVersion>
<pdfPageSize>538.307 x 697.433 pts</pdfPageSize>
<refBibsNative>false</refBibsNative>
<keywordCount>0</keywordCount>
<abstractCharCount>1542</abstractCharCount>
<pdfWordCount>6293</pdfWordCount>
<pdfCharCount>38200</pdfCharCount>
<pdfPageCount>16</pdfPageCount>
<abstractWordCount>230</abstractWordCount>
</qualityIndicators>
<title>The Scottish Corpus of Texts and Speech: Problems of Corpus Design</title>
<genre>
<json:string>research-article</json:string>
</genre>
<host>
<volume>18</volume>
<publisherId>
<json:string>litlin</json:string>
</publisherId>
<pages>
<last>37</last>
<first>23</first>
</pages>
<issn>
<json:string>0268-1145</json:string>
</issn>
<issue>1</issue>
<genre>
<json:string>journal</json:string>
</genre>
<language>
<json:string>unknown</json:string>
</language>
<eissn>
<json:string>1477-4615</json:string>
</eissn>
<title>Literary and Linguistic Computing</title>
</host>
<categories>
<wos>
<json:string>LINGUISTICS</json:string>
<json:string>LITERATURE</json:string>
</wos>
</categories>
<publicationDate>2003</publicationDate>
<copyrightDate>2003</copyrightDate>
<doi>
<json:string>10.1093/llc/18.1.23</json:string>
</doi>
<id>8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E</id>
<score>0.11841005</score>
<fulltext>
<json:item>
<original>true</original>
<mimetype>application/pdf</mimetype>
<extension>pdf</extension>
<uri>https://api.istex.fr/document/8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E/fulltext/pdf</uri>
</json:item>
<json:item>
<original>false</original>
<mimetype>application/zip</mimetype>
<extension>zip</extension>
<uri>https://api.istex.fr/document/8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E/fulltext/zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/document/8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E/fulltext/tei">
<teiHeader>
<fileDesc>
<titleStmt>
<title level="a" type="main" xml:lang="en">The Scottish Corpus of Texts and Speech: Problems of Corpus Design</title>
<respStmt xml:id="ISTEX-API" resp="Références bibliographiques récupérées via GROBID" name="ISTEX-API (INIST-CNRS)"></respStmt>
<respStmt xml:id="ISTEX-API" resp="Références bibliographiques récupérées via GROBID" name="ISTEX-API (INIST-CNRS)"></respStmt>
<respStmt>
<resp>Références bibliographiques récupérées via GROBID</resp>
<name resp="ISTEX-API">ISTEX-API (INIST-CNRS)</name>
</respStmt>
</titleStmt>
<publicationStmt>
<authority>ISTEX</authority>
<publisher>Oxford University Press</publisher>
<availability>
<p>OUP</p>
</availability>
<date>2003</date>
</publicationStmt>
<sourceDesc>
<biblStruct type="inbook">
<analytic>
<title level="a" type="main" xml:lang="en">The Scottish Corpus of Texts and Speech: Problems of Corpus Design</title>
<author>
<persName>
<forename type="first">Fiona M.</forename>
<surname>Douglas</surname>
</persName>
<affiliation>University of Glasgow, Glasgow, UK</affiliation>
</author>
</analytic>
<monogr>
<title level="j">Literary and Linguistic Computing</title>
<title level="j" type="abbrev">Lit Linguist Computing</title>
<idno type="pISSN">0268-1145</idno>
<idno type="eISSN">1477-4615</idno>
<imprint>
<publisher>Oxford University Press</publisher>
<date type="published" when="2003-04"></date>
<biblScope unit="volume">18</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="23">23</biblScope>
<biblScope unit="page" to="37">37</biblScope>
</imprint>
</monogr>
<idno type="istex">8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E</idno>
<idno type="DOI">10.1093/llc/18.1.23</idno>
<idno type="local">180023</idno>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<creation>
<date>2003</date>
</creation>
<langUsage>
<language ident="en">en</language>
</langUsage>
<abstract xml:lang="en">
<p>In recent years, the use of large corpora has revolutionized the way we study language. There are now numerous well‐established corpus projects, which have set the standard for future corpus‐based research. As more and more corpora are developed and technology continues to offer greater and greater scope, the emphasis has shifted from corpus size to establishing norms of good practice. There is also an increasingly critical appreciation of the crucial role played by corpus design. Corpus design can, however, present peculiar problems for particular types of source material. The Scottish Corpus of Texts and Speech (SCOTS) is the first large‐scale corpus project specifically dedicated to the languages of Scotland, and therefore it faces many unanswered questions, which will have a direct impact on the corpus design. The first phase of the project will focus on the language varieties Scots and Scottish English, varieties that are themselves notoriously difficult to define. This paper outlines the complexities of the Scottish linguistic situation, before going on to examine the problematic issue of how to construct a well‐balanced and representative corpus in what is largely uncharted territory. It argues that a well‐formed corpus cannot be constructed in a linguistic vacuum, and that familiarity with the overall language population is essential before effective corpus sampling techniques, methodologies, and categorization schema can be devised. It also offers some preliminary methodologies that will be adopted by SCOTS.</p>
</abstract>
</profileDesc>
<revisionDesc>
<change when="2003-04">Published</change>
<change xml:id="refBibs-istex" who="#ISTEX-API" when="2016-3-15">References added</change>
<change xml:id="refBibs-istex" who="#ISTEX-API" when="2016-3-21">References added</change>
<change xml:id="refBibs-istex" who="#ISTEX-API" when="2016-07-27">References added</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item>
<original>false</original>
<mimetype>text/plain</mimetype>
<extension>txt</extension>
<uri>https://api.istex.fr/document/8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E/fulltext/txt</uri>
</json:item>
</fulltext>
<metadata>
<istex:metadataXml wicri:clean="corpus oup" wicri:toSee="no header">
<istex:xmlDeclaration>version="1.0" encoding="US-ASCII"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" URI="journalpublishing.dtd" name="istex:docType"></istex:docType>
<istex:document>
<article xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">litlin</journal-id>
<journal-id journal-id-type="hwp">litlin</journal-id>
<journal-title>Literary and Linguistic Computing</journal-title>
<abbrev-journal-title abbrev-type="publisher">Lit Linguist Computing</abbrev-journal-title>
<issn pub-type="ppub">0268-1145</issn>
<issn pub-type="epub">1477-4615</issn>
<publisher>
<publisher-name>Oxford University Press</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="other">180023</article-id>
<article-id pub-id-type="doi">10.1093/llc/18.1.23</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>The Scottish Corpus of Texts and Speech: Problems of Corpus Design</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Douglas</surname>
<given-names>Fiona M.</given-names>
</name>
<xref rid="AFF1">1</xref>
</contrib>
<aff>
<target target-type="aff" id="AFF1"></target>
<label>1</label>
University of Glasgow, Glasgow, UK</aff>
</contrib-group>
<pub-date pub-type="ppub">
<month>04</month>
<year>2003</year>
</pub-date>
<volume>18</volume>
<issue>1</issue>
<fpage>23</fpage>
<lpage>37</lpage>
<permissions>
<copyright-statement>Copyright Association for Literary & Linguistic Computing 2003</copyright-statement>
<copyright-year>2003</copyright-year>
</permissions>
<abstract xml:lang="en">
<p>In recent years, the use of large corpora has revolutionized the way we study language. There are now numerous well‐established corpus projects, which have set the standard for future corpus‐based research. As more and more corpora are developed and technology continues to offer greater and greater scope, the emphasis has shifted from corpus size to establishing norms of good practice. There is also an increasingly critical appreciation of the crucial role played by corpus design. Corpus design can, however, present peculiar problems for particular types of source material. The Scottish Corpus of Texts and Speech (SCOTS) is the first large‐scale corpus project specifically dedicated to the languages of Scotland, and therefore it faces many unanswered questions, which will have a direct impact on the corpus design. The first phase of the project will focus on the language varieties
<italic>Scots</italic>
and
<italic>Scottish English</italic>
, varieties that are themselves notoriously difficult to define. This paper outlines the complexities of the Scottish linguistic situation, before going on to examine the problematic issue of how to construct a well‐balanced and representative corpus in what is largely uncharted territory. It argues that a well‐formed corpus cannot be constructed in a linguistic vacuum, and that familiarity with the overall language population is essential before effective corpus sampling techniques, methodologies, and categorization schema can be devised. It also offers some preliminary methodologies that will be adopted by SCOTS.</p>
</abstract>
<custom-meta-wrap>
<custom-meta>
<meta-name>hwp-legacy-fpage</meta-name>
<meta-value>23</meta-value>
</custom-meta>
<custom-meta>
<meta-name>hwp-legacy-dochead</meta-name>
<meta-value>Article</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
</article>
</istex:document>
</istex:metadataXml>
<mods version="3.6">
<titleInfo lang="en">
<title>The Scottish Corpus of Texts and Speech: Problems of Corpus Design</title>
</titleInfo>
<titleInfo type="alternative" lang="en" contentType="CDATA">
<title>The Scottish Corpus of Texts and Speech: Problems of Corpus Design</title>
</titleInfo>
<name type="personal">
<namePart type="given">Fiona M.</namePart>
<namePart type="family">Douglas</namePart>
<affiliation>University of Glasgow, Glasgow, UK</affiliation>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="research-article"></genre>
<originInfo>
<publisher>Oxford University Press</publisher>
<dateIssued encoding="w3cdtf">2003-04</dateIssued>
<copyrightDate encoding="w3cdtf">2003</copyrightDate>
</originInfo>
<language>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
</language>
<physicalDescription>
<internetMediaType>text/html</internetMediaType>
</physicalDescription>
<abstract lang="en">In recent years, the use of large corpora has revolutionized the way we study language. There are now numerous well‐established corpus projects, which have set the standard for future corpus‐based research. As more and more corpora are developed and technology continues to offer greater and greater scope, the emphasis has shifted from corpus size to establishing norms of good practice. There is also an increasingly critical appreciation of the crucial role played by corpus design. Corpus design can, however, present peculiar problems for particular types of source material. The Scottish Corpus of Texts and Speech (SCOTS) is the first large‐scale corpus project specifically dedicated to the languages of Scotland, and therefore it faces many unanswered questions, which will have a direct impact on the corpus design. The first phase of the project will focus on the language varieties Scots and Scottish English, varieties that are themselves notoriously difficult to define. This paper outlines the complexities of the Scottish linguistic situation, before going on to examine the problematic issue of how to construct a well‐balanced and representative corpus in what is largely uncharted territory. It argues that a well‐formed corpus cannot be constructed in a linguistic vacuum, and that familiarity with the overall language population is essential before effective corpus sampling techniques, methodologies, and categorization schema can be devised. It also offers some preliminary methodologies that will be adopted by SCOTS.</abstract>
<relatedItem type="host">
<titleInfo>
<title>Literary and Linguistic Computing</title>
</titleInfo>
<titleInfo type="abbreviated">
<title>Lit Linguist Computing</title>
</titleInfo>
<genre type="journal">journal</genre>
<identifier type="ISSN">0268-1145</identifier>
<identifier type="eISSN">1477-4615</identifier>
<identifier type="PublisherID">litlin</identifier>
<identifier type="PublisherID-hwp">litlin</identifier>
<part>
<date>2003</date>
<detail type="volume">
<caption>vol.</caption>
<number>18</number>
</detail>
<detail type="issue">
<caption>no.</caption>
<number>1</number>
</detail>
<extent unit="pages">
<start>23</start>
<end>37</end>
</extent>
</part>
</relatedItem>
<identifier type="istex">8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E</identifier>
<identifier type="DOI">10.1093/llc/18.1.23</identifier>
<identifier type="local">180023</identifier>
<accessCondition type="use and reproduction" contentType="copyright">Copyright Association for Literary & Linguistic Computing 2003</accessCondition>
<recordInfo>
<recordContentSource>OUP</recordContentSource>
</recordInfo>
</mods>
</metadata>
<enrichments>
<istex:catWosTEI uri="https://api.istex.fr/document/8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E/enrichments/catWos">
<teiHeader>
<profileDesc>
<textClass>
<classCode scheme="WOS">LINGUISTICS</classCode>
<classCode scheme="WOS">LITERATURE</classCode>
</textClass>
</profileDesc>
</teiHeader>
</istex:catWosTEI>
<json:item>
<type>refBibs</type>
<uri>https://api.istex.fr/document/8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E/enrichments/refBibs</uri>
</json:item>
</enrichments>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Ticri/explor/TeiVM2/Data/Istex/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000464 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 000464 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Ticri
   |area=    TeiVM2
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:8A136A881E2F17F4A79E29FDB1EFC0B4490DF24E
   |texte=   The Scottish Corpus of Texts and Speech: Problems of Corpus Design
}}

Wicri

This area was generated with Dilib version V0.6.31.
Data generation: Mon Oct 30 21:59:18 2017. Site generation: Sun Feb 11 23:16:06 2024