Automatic thesaurus generation for Chinese documents
Identifieur interne : 000C75 ( Main/Exploration ); précédent : 000C74; suivant : 000C76Automatic thesaurus generation for Chinese documents
Auteurs : Yuen-Hsien Tseng [Taïwan]Source :
- Journal of the American Society for Information Science and Technology [ 1532-2882 ] ; 2002-11.
English descriptors
- Teeft :
- Algorithm, Algorithm advances, American society, Application examples, Article reports, Association weights, Automatic thesaurus construction, Automatic thesaurus generation, Average precision, Background texts, Chen, Chinese documents, Chinese text, Chinese texts, Computational linguistics, Concept hierarchies, Cooccur, Cooccurrence, Cooccurrence analysis, Cooccurrence pairs, Cooccurrence thesauri, Cooccurrence thesaurus, Croft, Database, Document, Document frequencies, Document frequency, Document retrieval, English documents, Error rate, Extraction algorithm, Heuristic rule, High school, Human subjects, Index terms, Information retrieval, Information retrieval systems, Information science, Input text string, International conference, Keyword, Keyword database, Keyword extraction, Keyword extraction algorithm, Keywords, Knowledge discovery, Legal term, Long documents, Mergelist, Next section, Noun phrases, Previous studies, Previous work, Proper nouns, Query, Query expansion, Query term, Query terms, Relatedness, Retrieval, Salton, Sanderson croft, Search aide, Search term, Segmentation, Segmented, Segmented word, Sigir, Sigir conference, Sparck jones, Subsumption test, Target document, Term association analysis, Term association methods, Term associations, Term pairs, Term relatedness, Term selection, Term suggestion, Terms cooccur, Thesaurus, Thesaurus construction, Thesaurus generation process, Time stamps, Topical terms, Tseng, Unknown word, Word list.
Abstract
This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.
Url:
DOI: 10.1002/asi.10146
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 000188
- to stream Istex, to step Curation: 000180
- to stream Istex, to step Checkpoint: 000A67
- to stream Main, to step Merge: 000C76
- to stream Main, to step Curation: 000C75
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Automatic thesaurus generation for Chinese documents</title>
<author><name sortKey="Tseng, Yuen Sien" sort="Tseng, Yuen Sien" uniqKey="Tseng Y" first="Yuen-Hsien" last="Tseng">Yuen-Hsien Tseng</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C</idno>
<date when="2002" year="2002">2002</date>
<idno type="doi">10.1002/asi.10146</idno>
<idno type="url">https://api.istex.fr/document/12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000188</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">000188</idno>
<idno type="wicri:Area/Istex/Curation">000180</idno>
<idno type="wicri:Area/Istex/Checkpoint">000A67</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000A67</idno>
<idno type="wicri:doubleKey">1532-2882:2002:Tseng Y:automatic:thesaurus:generation</idno>
<idno type="wicri:Area/Main/Merge">000C76</idno>
<idno type="wicri:Area/Main/Curation">000C75</idno>
<idno type="wicri:Area/Main/Exploration">000C75</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Automatic thesaurus generation for Chinese documents</title>
<author><name sortKey="Tseng, Yuen Sien" sort="Tseng, Yuen Sien" uniqKey="Tseng Y" first="Yuen-Hsien" last="Tseng">Yuen-Hsien Tseng</name>
<affiliation></affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Taïwan</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Journal of the American Society for Information Science and Technology</title>
<title level="j" type="abbrev">J. Am. Soc. Inf. Sci.</title>
<idno type="ISSN">1532-2882</idno>
<idno type="eISSN">1532-2890</idno>
<imprint><publisher>Wiley Subscription Services, Inc., A Wiley Company</publisher>
<pubPlace>New York</pubPlace>
<date type="published" when="2002-11">2002-11</date>
<biblScope unit="volume">53</biblScope>
<biblScope unit="issue">13</biblScope>
<biblScope unit="page" from="1130">1130</biblScope>
<biblScope unit="page" to="1138">1138</biblScope>
</imprint>
<idno type="ISSN">1532-2882</idno>
</series>
<idno type="istex">12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C</idno>
<idno type="DOI">10.1002/asi.10146</idno>
<idno type="ArticleID">ASI10146</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">1532-2882</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="Teeft" xml:lang="en"><term>Algorithm</term>
<term>Algorithm advances</term>
<term>American society</term>
<term>Application examples</term>
<term>Article reports</term>
<term>Association weights</term>
<term>Automatic thesaurus construction</term>
<term>Automatic thesaurus generation</term>
<term>Average precision</term>
<term>Background texts</term>
<term>Chen</term>
<term>Chinese documents</term>
<term>Chinese text</term>
<term>Chinese texts</term>
<term>Computational linguistics</term>
<term>Concept hierarchies</term>
<term>Cooccur</term>
<term>Cooccurrence</term>
<term>Cooccurrence analysis</term>
<term>Cooccurrence pairs</term>
<term>Cooccurrence thesauri</term>
<term>Cooccurrence thesaurus</term>
<term>Croft</term>
<term>Database</term>
<term>Document</term>
<term>Document frequencies</term>
<term>Document frequency</term>
<term>Document retrieval</term>
<term>English documents</term>
<term>Error rate</term>
<term>Extraction algorithm</term>
<term>Heuristic rule</term>
<term>High school</term>
<term>Human subjects</term>
<term>Index terms</term>
<term>Information retrieval</term>
<term>Information retrieval systems</term>
<term>Information science</term>
<term>Input text string</term>
<term>International conference</term>
<term>Keyword</term>
<term>Keyword database</term>
<term>Keyword extraction</term>
<term>Keyword extraction algorithm</term>
<term>Keywords</term>
<term>Knowledge discovery</term>
<term>Legal term</term>
<term>Long documents</term>
<term>Mergelist</term>
<term>Next section</term>
<term>Noun phrases</term>
<term>Previous studies</term>
<term>Previous work</term>
<term>Proper nouns</term>
<term>Query</term>
<term>Query expansion</term>
<term>Query term</term>
<term>Query terms</term>
<term>Relatedness</term>
<term>Retrieval</term>
<term>Salton</term>
<term>Sanderson croft</term>
<term>Search aide</term>
<term>Search term</term>
<term>Segmentation</term>
<term>Segmented</term>
<term>Segmented word</term>
<term>Sigir</term>
<term>Sigir conference</term>
<term>Sparck jones</term>
<term>Subsumption test</term>
<term>Target document</term>
<term>Term association analysis</term>
<term>Term association methods</term>
<term>Term associations</term>
<term>Term pairs</term>
<term>Term relatedness</term>
<term>Term selection</term>
<term>Term suggestion</term>
<term>Terms cooccur</term>
<term>Thesaurus</term>
<term>Thesaurus construction</term>
<term>Thesaurus generation process</term>
<term>Time stamps</term>
<term>Topical terms</term>
<term>Tseng</term>
<term>Unknown word</term>
<term>Word list</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.</div>
</front>
</TEI>
<affiliations><list><country><li>Taïwan</li>
</country>
</list>
<tree><country name="Taïwan"><noRegion><name sortKey="Tseng, Yuen Sien" sort="Tseng, Yuen Sien" uniqKey="Tseng Y" first="Yuen-Hsien" last="Tseng">Yuen-Hsien Tseng</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Sarre/explor/MusicSarreV3/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C75 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000C75 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Sarre |area= MusicSarreV3 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C |texte= Automatic thesaurus generation for Chinese documents }}
This area was generated with Dilib version V0.6.33. |