Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Categorizing Web Information on Subject with Statistical Language Modeling

Identifieur interne : 001612 ( Main/Merge ); précédent : 001611; suivant : 001613

Categorizing Web Information on Subject with Statistical Language Modeling

Auteurs : Xindong Zhou [République populaire de Chine] ; Ting Wang [République populaire de Chine] ; Huiping Zhou [République populaire de Chine] ; Huowang Chen [République populaire de Chine]

Source :

RBID : ISTEX:1C9D43A1CB78032B24AB9BEFE1A33830E723FE1C

Abstract

Abstract: With the rapid growth of the available information on the Internet, it is more difficult for us to find the relevant information quickly on the Web. Text classification, one of the most useful web information processing tools, has been paid more and more attention recently. Instead of using traditional classification models, we apply n-gram language models to classify Chinese Web text information on subject. We investigate several factors that have important effect on the performance of n-gram models, including various order n, different smoothing techniques, and different granularity of textual representation unit in Chinese. The experiment result indicates that bi-gram model based on word and tri-gram model based on character outperform others, achieving approximately 90% evaluated by F1 score.

Url:
DOI: 10.1007/978-3-540-30480-7_41

Links toward previous steps (curation, corpus...)


Links to Exploration step

ISTEX:1C9D43A1CB78032B24AB9BEFE1A33830E723FE1C

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Categorizing Web Information on Subject with Statistical Language Modeling</title>
<author>
<name sortKey="Zhou, Xindong" sort="Zhou, Xindong" uniqKey="Zhou X" first="Xindong" last="Zhou">Xindong Zhou</name>
</author>
<author>
<name sortKey="Wang, Ting" sort="Wang, Ting" uniqKey="Wang T" first="Ting" last="Wang">Ting Wang</name>
</author>
<author>
<name sortKey="Zhou, Huiping" sort="Zhou, Huiping" uniqKey="Zhou H" first="Huiping" last="Zhou">Huiping Zhou</name>
</author>
<author>
<name sortKey="Chen, Huowang" sort="Chen, Huowang" uniqKey="Chen H" first="Huowang" last="Chen">Huowang Chen</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:1C9D43A1CB78032B24AB9BEFE1A33830E723FE1C</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1007/978-3-540-30480-7_41</idno>
<idno type="url">https://api.istex.fr/document/1C9D43A1CB78032B24AB9BEFE1A33830E723FE1C/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">003972</idno>
<idno type="wicri:Area/Istex/Curation">003708</idno>
<idno type="wicri:Area/Istex/Checkpoint">000D96</idno>
<idno type="wicri:doubleKey">0302-9743:2004:Zhou X:categorizing:web:information</idno>
<idno type="wicri:Area/Main/Merge">001612</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Categorizing Web Information on Subject with Statistical Language Modeling</title>
<author>
<name sortKey="Zhou, Xindong" sort="Zhou, Xindong" uniqKey="Zhou X" first="Xindong" last="Zhou">Xindong Zhou</name>
<affiliation wicri:level="1">
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>National Laboratory for Parallel and Distributed Processing, 410073, Changsha, Hunan</wicri:regionArea>
<wicri:noRegion>Hunan</wicri:noRegion>
</affiliation>
<affiliation>
<wicri:noCountry code="no comma">E-mail: zhouxindong@sohu.com</wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Wang, Ting" sort="Wang, Ting" uniqKey="Wang T" first="Ting" last="Wang">Ting Wang</name>
<affiliation wicri:level="1">
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>National Laboratory for Parallel and Distributed Processing, 410073, Changsha, Hunan</wicri:regionArea>
<wicri:noRegion>Hunan</wicri:noRegion>
</affiliation>
<affiliation>
<wicri:noCountry code="no comma">E-mail: wonderwang70@hotmail.com</wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Zhou, Huiping" sort="Zhou, Huiping" uniqKey="Zhou H" first="Huiping" last="Zhou">Huiping Zhou</name>
<affiliation wicri:level="1">
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>National Laboratory for Parallel and Distributed Processing, 410073, Changsha, Hunan</wicri:regionArea>
<wicri:noRegion>Hunan</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Chen, Huowang" sort="Chen, Huowang" uniqKey="Chen H" first="Huowang" last="Chen">Huowang Chen</name>
<affiliation wicri:level="1">
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>National Laboratory for Parallel and Distributed Processing, 410073, Changsha, Hunan</wicri:regionArea>
<wicri:noRegion>Hunan</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2004</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">1C9D43A1CB78032B24AB9BEFE1A33830E723FE1C</idno>
<idno type="DOI">10.1007/978-3-540-30480-7_41</idno>
<idno type="ChapterID">41</idno>
<idno type="ChapterID">Chap41</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: With the rapid growth of the available information on the Internet, it is more difficult for us to find the relevant information quickly on the Web. Text classification, one of the most useful web information processing tools, has been paid more and more attention recently. Instead of using traditional classification models, we apply n-gram language models to classify Chinese Web text information on subject. We investigate several factors that have important effect on the performance of n-gram models, including various order n, different smoothing techniques, and different granularity of textual representation unit in Chinese. The experiment result indicates that bi-gram model based on word and tri-gram model based on character outperform others, achieving approximately 90% evaluated by F1 score.</div>
</front>
</TEI>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001612 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 001612 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     ISTEX:1C9D43A1CB78032B24AB9BEFE1A33830E723FE1C
   |texte=   Categorizing Web Information on Subject with Statistical Language Modeling
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024