Categorizing Web Information on Subject with Statistical Language Modeling
Identifieur interne : 001612 ( Main/Merge ); précédent : 001611; suivant : 001613Categorizing Web Information on Subject with Statistical Language Modeling
Auteurs : Xindong Zhou [République populaire de Chine] ; Ting Wang [République populaire de Chine] ; Huiping Zhou [République populaire de Chine] ; Huowang Chen [République populaire de Chine]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2004.
Abstract
Abstract: With the rapid growth of the available information on the Internet, it is more difficult for us to find the relevant information quickly on the Web. Text classification, one of the most useful web information processing tools, has been paid more and more attention recently. Instead of using traditional classification models, we apply n-gram language models to classify Chinese Web text information on subject. We investigate several factors that have important effect on the performance of n-gram models, including various order n, different smoothing techniques, and different granularity of textual representation unit in Chinese. The experiment result indicates that bi-gram model based on word and tri-gram model based on character outperform others, achieving approximately 90% evaluated by F1 score.
Url:
DOI: 10.1007/978-3-540-30480-7_41
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 003972
- to stream Istex, to step Curation: 003708
- to stream Istex, to step Checkpoint: 000D96
Links to Exploration step
ISTEX:1C9D43A1CB78032B24AB9BEFE1A33830E723FE1CLe document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Categorizing Web Information on Subject with Statistical Language Modeling</title>
<author><name sortKey="Zhou, Xindong" sort="Zhou, Xindong" uniqKey="Zhou X" first="Xindong" last="Zhou">Xindong Zhou</name>
</author>
<author><name sortKey="Wang, Ting" sort="Wang, Ting" uniqKey="Wang T" first="Ting" last="Wang">Ting Wang</name>
</author>
<author><name sortKey="Zhou, Huiping" sort="Zhou, Huiping" uniqKey="Zhou H" first="Huiping" last="Zhou">Huiping Zhou</name>
</author>
<author><name sortKey="Chen, Huowang" sort="Chen, Huowang" uniqKey="Chen H" first="Huowang" last="Chen">Huowang Chen</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:1C9D43A1CB78032B24AB9BEFE1A33830E723FE1C</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1007/978-3-540-30480-7_41</idno>
<idno type="url">https://api.istex.fr/document/1C9D43A1CB78032B24AB9BEFE1A33830E723FE1C/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">003972</idno>
<idno type="wicri:Area/Istex/Curation">003708</idno>
<idno type="wicri:Area/Istex/Checkpoint">000D96</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Zhou X:categorizing:web:information</idno>
<idno type="wicri:Area/Main/Merge">001612</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Categorizing Web Information on Subject with Statistical Language Modeling</title>
<author><name sortKey="Zhou, Xindong" sort="Zhou, Xindong" uniqKey="Zhou X" first="Xindong" last="Zhou">Xindong Zhou</name>
<affiliation wicri:level="1"><country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>National Laboratory for Parallel and Distributed Processing, 410073, Changsha, Hunan</wicri:regionArea>
<wicri:noRegion>Hunan</wicri:noRegion>
</affiliation>
<affiliation><wicri:noCountry code="no comma">E-mail: zhouxindong@sohu.com</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Wang, Ting" sort="Wang, Ting" uniqKey="Wang T" first="Ting" last="Wang">Ting Wang</name>
<affiliation wicri:level="1"><country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>National Laboratory for Parallel and Distributed Processing, 410073, Changsha, Hunan</wicri:regionArea>
<wicri:noRegion>Hunan</wicri:noRegion>
</affiliation>
<affiliation><wicri:noCountry code="no comma">E-mail: wonderwang70@hotmail.com</wicri:noCountry>
</affiliation>
</author>
<author><name sortKey="Zhou, Huiping" sort="Zhou, Huiping" uniqKey="Zhou H" first="Huiping" last="Zhou">Huiping Zhou</name>
<affiliation wicri:level="1"><country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>National Laboratory for Parallel and Distributed Processing, 410073, Changsha, Hunan</wicri:regionArea>
<wicri:noRegion>Hunan</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Chen, Huowang" sort="Chen, Huowang" uniqKey="Chen H" first="Huowang" last="Chen">Huowang Chen</name>
<affiliation wicri:level="1"><country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>National Laboratory for Parallel and Distributed Processing, 410073, Changsha, Hunan</wicri:regionArea>
<wicri:noRegion>Hunan</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2004</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">1C9D43A1CB78032B24AB9BEFE1A33830E723FE1C</idno>
<idno type="DOI">10.1007/978-3-540-30480-7_41</idno>
<idno type="ChapterID">41</idno>
<idno type="ChapterID">Chap41</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: With the rapid growth of the available information on the Internet, it is more difficult for us to find the relevant information quickly on the Web. Text classification, one of the most useful web information processing tools, has been paid more and more attention recently. Instead of using traditional classification models, we apply n-gram language models to classify Chinese Web text information on subject. We investigate several factors that have important effect on the performance of n-gram models, including various order n, different smoothing techniques, and different granularity of textual representation unit in Chinese. The experiment result indicates that bi-gram model based on word and tri-gram model based on character outperform others, achieving approximately 90% evaluated by F1 score.</div>
</front>
</TEI>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001612 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 001612 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Merge |type= RBID |clé= ISTEX:1C9D43A1CB78032B24AB9BEFE1A33830E723FE1C |texte= Categorizing Web Information on Subject with Statistical Language Modeling }}
This area was generated with Dilib version V0.6.32. |