Serveur d'exploration sur la musique en Sarre

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Automatic thesaurus generation for Chinese documents

Identifieur interne : 000C75 ( Main/Exploration ); précédent : 000C74; suivant : 000C76

Automatic thesaurus generation for Chinese documents

Auteurs : Yuen-Hsien Tseng [Taïwan]

Source :

RBID : ISTEX:12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C

English descriptors

Abstract

This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.

Url:
DOI: 10.1002/asi.10146


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Automatic thesaurus generation for Chinese documents</title>
<author>
<name sortKey="Tseng, Yuen Sien" sort="Tseng, Yuen Sien" uniqKey="Tseng Y" first="Yuen-Hsien" last="Tseng">Yuen-Hsien Tseng</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C</idno>
<date when="2002" year="2002">2002</date>
<idno type="doi">10.1002/asi.10146</idno>
<idno type="url">https://api.istex.fr/document/12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000188</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">000188</idno>
<idno type="wicri:Area/Istex/Curation">000180</idno>
<idno type="wicri:Area/Istex/Checkpoint">000A67</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000A67</idno>
<idno type="wicri:doubleKey">1532-2882:2002:Tseng Y:automatic:thesaurus:generation</idno>
<idno type="wicri:Area/Main/Merge">000C76</idno>
<idno type="wicri:Area/Main/Curation">000C75</idno>
<idno type="wicri:Area/Main/Exploration">000C75</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Automatic thesaurus generation for Chinese documents</title>
<author>
<name sortKey="Tseng, Yuen Sien" sort="Tseng, Yuen Sien" uniqKey="Tseng Y" first="Yuen-Hsien" last="Tseng">Yuen-Hsien Tseng</name>
<affiliation></affiliation>
<affiliation wicri:level="1">
<country wicri:rule="url">Taïwan</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Journal of the American Society for Information Science and Technology</title>
<title level="j" type="abbrev">J. Am. Soc. Inf. Sci.</title>
<idno type="ISSN">1532-2882</idno>
<idno type="eISSN">1532-2890</idno>
<imprint>
<publisher>Wiley Subscription Services, Inc., A Wiley Company</publisher>
<pubPlace>New York</pubPlace>
<date type="published" when="2002-11">2002-11</date>
<biblScope unit="volume">53</biblScope>
<biblScope unit="issue">13</biblScope>
<biblScope unit="page" from="1130">1130</biblScope>
<biblScope unit="page" to="1138">1138</biblScope>
</imprint>
<idno type="ISSN">1532-2882</idno>
</series>
<idno type="istex">12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C</idno>
<idno type="DOI">10.1002/asi.10146</idno>
<idno type="ArticleID">ASI10146</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">1532-2882</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="Teeft" xml:lang="en">
<term>Algorithm</term>
<term>Algorithm advances</term>
<term>American society</term>
<term>Application examples</term>
<term>Article reports</term>
<term>Association weights</term>
<term>Automatic thesaurus construction</term>
<term>Automatic thesaurus generation</term>
<term>Average precision</term>
<term>Background texts</term>
<term>Chen</term>
<term>Chinese documents</term>
<term>Chinese text</term>
<term>Chinese texts</term>
<term>Computational linguistics</term>
<term>Concept hierarchies</term>
<term>Cooccur</term>
<term>Cooccurrence</term>
<term>Cooccurrence analysis</term>
<term>Cooccurrence pairs</term>
<term>Cooccurrence thesauri</term>
<term>Cooccurrence thesaurus</term>
<term>Croft</term>
<term>Database</term>
<term>Document</term>
<term>Document frequencies</term>
<term>Document frequency</term>
<term>Document retrieval</term>
<term>English documents</term>
<term>Error rate</term>
<term>Extraction algorithm</term>
<term>Heuristic rule</term>
<term>High school</term>
<term>Human subjects</term>
<term>Index terms</term>
<term>Information retrieval</term>
<term>Information retrieval systems</term>
<term>Information science</term>
<term>Input text string</term>
<term>International conference</term>
<term>Keyword</term>
<term>Keyword database</term>
<term>Keyword extraction</term>
<term>Keyword extraction algorithm</term>
<term>Keywords</term>
<term>Knowledge discovery</term>
<term>Legal term</term>
<term>Long documents</term>
<term>Mergelist</term>
<term>Next section</term>
<term>Noun phrases</term>
<term>Previous studies</term>
<term>Previous work</term>
<term>Proper nouns</term>
<term>Query</term>
<term>Query expansion</term>
<term>Query term</term>
<term>Query terms</term>
<term>Relatedness</term>
<term>Retrieval</term>
<term>Salton</term>
<term>Sanderson croft</term>
<term>Search aide</term>
<term>Search term</term>
<term>Segmentation</term>
<term>Segmented</term>
<term>Segmented word</term>
<term>Sigir</term>
<term>Sigir conference</term>
<term>Sparck jones</term>
<term>Subsumption test</term>
<term>Target document</term>
<term>Term association analysis</term>
<term>Term association methods</term>
<term>Term associations</term>
<term>Term pairs</term>
<term>Term relatedness</term>
<term>Term selection</term>
<term>Term suggestion</term>
<term>Terms cooccur</term>
<term>Thesaurus</term>
<term>Thesaurus construction</term>
<term>Thesaurus generation process</term>
<term>Time stamps</term>
<term>Topical terms</term>
<term>Tseng</term>
<term>Unknown word</term>
<term>Word list</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Taïwan</li>
</country>
</list>
<tree>
<country name="Taïwan">
<noRegion>
<name sortKey="Tseng, Yuen Sien" sort="Tseng, Yuen Sien" uniqKey="Tseng Y" first="Yuen-Hsien" last="Tseng">Yuen-Hsien Tseng</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Sarre/explor/MusicSarreV3/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C75 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000C75 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Sarre
   |area=    MusicSarreV3
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C
   |texte=   Automatic thesaurus generation for Chinese documents
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Sun Jul 15 18:16:09 2018. Site generation: Tue Mar 5 19:21:25 2024