Mining a corpus of biographical texts using keywords
Identifieur interne : 000059 ( Main/Merge ); précédent : 000058; suivant : 000060Mining a corpus of biographical texts using keywords
Auteurs : Mike Conway [Japon]Source :
- Literary and Linguistic Computing [ 0268-1145 ] ; 2010-04.
Abstract
Using statistically derived keywords to characterize texts has become an important research method for digital humanists and corpus linguists in areas such as literary analysis and the exploration of genre difference. Keywordsand the associated concepts of keyness and key-keynesshave inspired conferences and workshops, many and varied research papers, and are central to several modern corpus processing tools. In this article, we present evidence that (at least for the task of biographical sentence classification) frequent words characterize texts better than keywords or key-keywords. Using the nave Bayes learning algorithm in conjunction with frequency-, keyword-, and key-keyword-based text representation to classify a corpus of biographical sentences, we discovered that the use of frequent words alone provided a classification accuracy better than either the keyword or key-keyword representations at a statistically significant level. This result suggests that (for the biographical sentence classification task at least) frequent words characterize texts better than keywords derived using more computationally intensive methods.
Url:
DOI: 10.1093/llc/fqp035
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 000385
- to stream Istex, to step Curation: 000385
- to stream Istex, to step Checkpoint: 000030
Links to Exploration step
ISTEX:7C87E90B31174A63CC37AC19EBEA8261A9D411BBLe document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title>Mining a corpus of biographical texts using keywords</title>
<author wicri:is="90%"><name sortKey="Conway, Mike" sort="Conway, Mike" uniqKey="Conway M" first="Mike" last="Conway">Mike Conway</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:7C87E90B31174A63CC37AC19EBEA8261A9D411BB</idno>
<date when="2010" year="2010">2010</date>
<idno type="doi">10.1093/llc/fqp035</idno>
<idno type="url">https://api.istex.fr/document/7C87E90B31174A63CC37AC19EBEA8261A9D411BB/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000385</idno>
<idno type="wicri:Area/Istex/Curation">000385</idno>
<idno type="wicri:Area/Istex/Checkpoint">000030</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000030</idno>
<idno type="wicri:doubleKey">0268-1145:2010:Conway M:mining:a:corpus</idno>
<idno type="wicri:Area/Main/Merge">000059</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a">Mining a corpus of biographical texts using keywords</title>
<author wicri:is="90%"><name sortKey="Conway, Mike" sort="Conway, Mike" uniqKey="Conway M" first="Mike" last="Conway">Mike Conway</name>
<affiliation wicri:level="1"><country xml:lang="fr">Japon</country>
<wicri:regionArea>National Institute of Informatics</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Japon</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Literary and Linguistic Computing</title>
<idno type="ISSN">0268-1145</idno>
<idno type="eISSN">1477-4615</idno>
<imprint><publisher>Oxford University Press</publisher>
<date type="published" when="2010-04">2010-04</date>
<biblScope unit="volume">25</biblScope>
<biblScope unit="issue">1</biblScope>
<biblScope unit="page" from="23">23</biblScope>
<biblScope unit="page" to="35">35</biblScope>
</imprint>
<idno type="ISSN">0268-1145</idno>
</series>
<idno type="istex">7C87E90B31174A63CC37AC19EBEA8261A9D411BB</idno>
<idno type="DOI">10.1093/llc/fqp035</idno>
<idno type="ArticleID">fqp035</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0268-1145</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract">Using statistically derived keywords to characterize texts has become an important research method for digital humanists and corpus linguists in areas such as literary analysis and the exploration of genre difference. Keywordsand the associated concepts of keyness and key-keynesshave inspired conferences and workshops, many and varied research papers, and are central to several modern corpus processing tools. In this article, we present evidence that (at least for the task of biographical sentence classification) frequent words characterize texts better than keywords or key-keywords. Using the nave Bayes learning algorithm in conjunction with frequency-, keyword-, and key-keyword-based text representation to classify a corpus of biographical sentences, we discovered that the use of frequent words alone provided a classification accuracy better than either the keyword or key-keyword representations at a statistically significant level. This result suggests that (for the biographical sentence classification task at least) frequent words characterize texts better than keywords derived using more computationally intensive methods.</div>
</front>
</TEI>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Ticri/explor/TeiVM2/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000059 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 000059 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Ticri |area= TeiVM2 |flux= Main |étape= Merge |type= RBID |clé= ISTEX:7C87E90B31174A63CC37AC19EBEA8261A9D411BB |texte= Mining a corpus of biographical texts using keywords }}
This area was generated with Dilib version V0.6.31. |