MusicSarreV3, Main, Exploration, bibRecord, 000C75

Automatic thesaurus generation for Chinese documents

Identifieur interne : 000C75 ( Main/Exploration ); précédent : 000C74; suivant : 000C76

Automatic thesaurus generation for Chinese documents

Auteurs : Yuen-Hsien Tseng [Taïwan]

Source :

Journal of the American Society for Information Science and Technology [ 1532-2882 ] ; 2002-11.

RBID : ISTEX:12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C

English descriptors

Teeft :
- Algorithm, Algorithm advances, American society, Application examples, Article reports, Association weights, Automatic thesaurus construction, Automatic thesaurus generation, Average precision, Background texts, Chen, Chinese documents, Chinese text, Chinese texts, Computational linguistics, Concept hierarchies, Cooccur, Cooccurrence, Cooccurrence analysis, Cooccurrence pairs, Cooccurrence thesauri, Cooccurrence thesaurus, Croft, Database, Document, Document frequencies, Document frequency, Document retrieval, English documents, Error rate, Extraction algorithm, Heuristic rule, High school, Human subjects, Index terms, Information retrieval, Information retrieval systems, Information science, Input text string, International conference, Keyword, Keyword database, Keyword extraction, Keyword extraction algorithm, Keywords, Knowledge discovery, Legal term, Long documents, Mergelist, Next section, Noun phrases, Previous studies, Previous work, Proper nouns, Query, Query expansion, Query term, Query terms, Relatedness, Retrieval, Salton, Sanderson croft, Search aide, Search term, Segmentation, Segmented, Segmented word, Sigir, Sigir conference, Sparck jones, Subsumption test, Target document, Term association analysis, Term association methods, Term associations, Term pairs, Term relatedness, Term selection, Term suggestion, Terms cooccur, Thesaurus, Thesaurus construction, Thesaurus generation process, Time stamps, Topical terms, Tseng, Unknown word, Word list.

Abstract

This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.

Url:

https://api.istex.fr/document/12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C/fulltext/pdf

DOI: 10.1002/asi.10146

Affiliations:

Taïwan

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 000188
to stream Istex, to step Curation: 000180
to stream Istex, to step Checkpoint: 000A67
to stream Main, to step Merge: 000C76
to stream Main, to step Curation: 000C75

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Automatic thesaurus generation for Chinese documents</title>
<author><name sortKey="Tseng, Yuen Sien" sort="Tseng, Yuen Sien" uniqKey="Tseng Y" first="Yuen-Hsien" last="Tseng">Yuen-Hsien Tseng</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C</idno>
<date when="2002" year="2002">2002</date>
<idno type="doi">10.1002/asi.10146</idno>
<idno type="url">https://api.istex.fr/document/12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000188</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">000188</idno>
<idno type="wicri:Area/Istex/Curation">000180</idno>
<idno type="wicri:Area/Istex/Checkpoint">000A67</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000A67</idno>
<idno type="wicri:doubleKey">1532-2882:2002:Tseng Y:automatic:thesaurus:generation</idno>
<idno type="wicri:Area/Main/Merge">000C76</idno>
<idno type="wicri:Area/Main/Curation">000C75</idno>
<idno type="wicri:Area/Main/Exploration">000C75</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Automatic thesaurus generation for Chinese documents</title>
<author><name sortKey="Tseng, Yuen Sien" sort="Tseng, Yuen Sien" uniqKey="Tseng Y" first="Yuen-Hsien" last="Tseng">Yuen-Hsien Tseng</name>
<affiliation></affiliation>
<affiliation wicri:level="1"><country wicri:rule="url">Taïwan</country>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Journal of the American Society for Information Science and Technology</title>
<title level="j" type="abbrev">J. Am. Soc. Inf. Sci.</title>
<idno type="ISSN">1532-2882</idno>
<idno type="eISSN">1532-2890</idno>
<imprint><publisher>Wiley Subscription Services, Inc., A Wiley Company</publisher>
<pubPlace>New York</pubPlace>
<date type="published" when="2002-11">2002-11</date>
<biblScope unit="volume">53</biblScope>
<biblScope unit="issue">13</biblScope>
<biblScope unit="page" from="1130">1130</biblScope>
<biblScope unit="page" to="1138">1138</biblScope>
</imprint>
<idno type="ISSN">1532-2882</idno>
</series>
<idno type="istex">12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C</idno>
<idno type="DOI">10.1002/asi.10146</idno>
<idno type="ArticleID">ASI10146</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">1532-2882</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="Teeft" xml:lang="en"><term>Algorithm</term>
<term>Algorithm advances</term>
<term>American society</term>
<term>Application examples</term>
<term>Article reports</term>
<term>Association weights</term>
<term>Automatic thesaurus construction</term>
<term>Automatic thesaurus generation</term>
<term>Average precision</term>
<term>Background texts</term>
<term>Chen</term>
<term>Chinese documents</term>
<term>Chinese text</term>
<term>Chinese texts</term>
<term>Computational linguistics</term>
<term>Concept hierarchies</term>
<term>Cooccur</term>
<term>Cooccurrence</term>
<term>Cooccurrence analysis</term>
<term>Cooccurrence pairs</term>
<term>Cooccurrence thesauri</term>
<term>Cooccurrence thesaurus</term>
<term>Croft</term>
<term>Database</term>
<term>Document</term>
<term>Document frequencies</term>
<term>Document frequency</term>
<term>Document retrieval</term>
<term>English documents</term>
<term>Error rate</term>
<term>Extraction algorithm</term>
<term>Heuristic rule</term>
<term>High school</term>
<term>Human subjects</term>
<term>Index terms</term>
<term>Information retrieval</term>
<term>Information retrieval systems</term>
<term>Information science</term>
<term>Input text string</term>
<term>International conference</term>
<term>Keyword</term>
<term>Keyword database</term>
<term>Keyword extraction</term>
<term>Keyword extraction algorithm</term>
<term>Keywords</term>
<term>Knowledge discovery</term>
<term>Legal term</term>
<term>Long documents</term>
<term>Mergelist</term>
<term>Next section</term>
<term>Noun phrases</term>
<term>Previous studies</term>
<term>Previous work</term>
<term>Proper nouns</term>
<term>Query</term>
<term>Query expansion</term>
<term>Query term</term>
<term>Query terms</term>
<term>Relatedness</term>
<term>Retrieval</term>
<term>Salton</term>
<term>Sanderson croft</term>
<term>Search aide</term>
<term>Search term</term>
<term>Segmentation</term>
<term>Segmented</term>
<term>Segmented word</term>
<term>Sigir</term>
<term>Sigir conference</term>
<term>Sparck jones</term>
<term>Subsumption test</term>
<term>Target document</term>
<term>Term association analysis</term>
<term>Term association methods</term>
<term>Term associations</term>
<term>Term pairs</term>
<term>Term relatedness</term>
<term>Term selection</term>
<term>Term suggestion</term>
<term>Terms cooccur</term>
<term>Thesaurus</term>
<term>Thesaurus construction</term>
<term>Thesaurus generation process</term>
<term>Time stamps</term>
<term>Topical terms</term>
<term>Tseng</term>
<term>Unknown word</term>
<term>Word list</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This article reports an approach to automatic thesaurus construction for Chinese documents. An effective Chinese keyword extraction algorithm is first presented. Experiments showed that for each document an average of 33% keywords unknown to a lexicon of 123,226 terms could be identified by this algorithm. Of these unregistered words, only 8.3% of them are illegal. Keywords extracted from each document are further filtered for term association analysis. Association weights larger than a threshold are then accumulated over all the documents to yield the final term pair similarities. Compared to previous studies, this method speeds up the thesaurus generation process drastically. It also achieves a similar percentage level of term relatedness.</div>
</front>
</TEI>
<affiliations><list><country><li>Taïwan</li>
</country>
</list>
<tree><country name="Taïwan"><noRegion><name sortKey="Tseng, Yuen Sien" sort="Tseng, Yuen Sien" uniqKey="Tseng Y" first="Yuen-Hsien" last="Tseng">Yuen-Hsien Tseng</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Sarre/explor/MusicSarreV3/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000C75 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000C75 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Sarre
   |area=    MusicSarreV3
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:12FFA5DB90779E1F099F5EA0AFA95D1B624DF51C
   |texte=   Automatic thesaurus generation for Chinese documents
}}

This area was generated with Dilib version V0.6.33.
Data generation: Sun Jul 15 18:16:09 2018. Site generation: Tue Mar 5 19:21:25 2024

	Serveur d'exploration sur la musique en Sarre
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur la musique en Sarre

Automatic thesaurus generation for Chinese documents

Automatic thesaurus generation for Chinese documents

Source :

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri