Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A Chinese OCR spelling check approach based on statistical language models

Identifieur interne : 001732 ( Main/Merge ); précédent : 001731; suivant : 001733

A Chinese OCR spelling check approach based on statistical language models

Auteurs : LI ZHUANG [République populaire de Chine] ; TA BAO [République populaire de Chine] ; XIAOYAN ZHU [République populaire de Chine] ; CHUNHENG WANG [République populaire de Chine] ; Satoshi Naoi [Japon]

Source :

RBID : Pascal:06-0112413

Descripteurs français

English descriptors

Abstract

This paper describes an effective spelling check approach for Chinese OCR with a new multiknowledge based statistical language model. This language model combines the conventional n-gram language model and the new LSA (Latent Semantic Analysis) language model, so both local information (syntax) and global information (semantic) are utilized. Furthermore, Chinese similar characters are used in Viterbi search process to expand the candidate list in order to add more possible correct results. With our approach, the best recognition accuracy rate increases from 79.3% to 91.9%, which means 60.9% error reduction.

Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:06-0112413

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">A Chinese OCR spelling check approach based on statistical language models</title>
<author>
<name sortKey="Li Zhuang" sort="Li Zhuang" uniqKey="Li Zhuang" last="Li Zhuang">LI ZHUANG</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3">
<inist:fA14 i1="02">
<s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Ta Bao" sort="Ta Bao" uniqKey="Ta Bao" last="Ta Bao">TA BAO</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3">
<inist:fA14 i1="02">
<s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Xiaoyan Zhu" sort="Xiaoyan Zhu" uniqKey="Xiaoyan Zhu" last="Xiaoyan Zhu">XIAOYAN ZHU</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3">
<inist:fA14 i1="02">
<s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Chunheng Wang" sort="Chunheng Wang" uniqKey="Chunheng Wang" last="Chunheng Wang">CHUNHENG WANG</name>
<affiliation wicri:level="3">
<inist:fA14 i1="03">
<s1>Fujitsu R&D Center Co. Ltd</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Naoi, Satoshi" sort="Naoi, Satoshi" uniqKey="Naoi S" first="Satoshi" last="Naoi">Satoshi Naoi</name>
<affiliation wicri:level="1">
<inist:fA14 i1="04">
<s1>Satosm Naof Fujitsu Laboratories Ltd</s1>
<s2>Kawasaki</s2>
<s3>JPN</s3>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Satosm Naof Fujitsu Laboratories Ltd</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">06-0112413</idno>
<date when="2004">2004</date>
<idno type="stanalyst">PASCAL 06-0112413 INIST</idno>
<idno type="RBID">Pascal:06-0112413</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000405</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000382</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000521</idno>
<idno type="wicri:Area/Main/Merge">001732</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">A Chinese OCR spelling check approach based on statistical language models</title>
<author>
<name sortKey="Li Zhuang" sort="Li Zhuang" uniqKey="Li Zhuang" last="Li Zhuang">LI ZHUANG</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3">
<inist:fA14 i1="02">
<s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Ta Bao" sort="Ta Bao" uniqKey="Ta Bao" last="Ta Bao">TA BAO</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3">
<inist:fA14 i1="02">
<s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Xiaoyan Zhu" sort="Xiaoyan Zhu" uniqKey="Xiaoyan Zhu" last="Xiaoyan Zhu">XIAOYAN ZHU</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3">
<inist:fA14 i1="02">
<s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Chunheng Wang" sort="Chunheng Wang" uniqKey="Chunheng Wang" last="Chunheng Wang">CHUNHENG WANG</name>
<affiliation wicri:level="3">
<inist:fA14 i1="03">
<s1>Fujitsu R&D Center Co. Ltd</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName>
<settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Naoi, Satoshi" sort="Naoi, Satoshi" uniqKey="Naoi S" first="Satoshi" last="Naoi">Satoshi Naoi</name>
<affiliation wicri:level="1">
<inist:fA14 i1="04">
<s1>Satosm Naof Fujitsu Laboratories Ltd</s1>
<s2>Kawasaki</s2>
<s3>JPN</s3>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Satosm Naof Fujitsu Laboratories Ltd</wicri:noRegion>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Character recognition</term>
<term>Chinese</term>
<term>Knowledge representation</term>
<term>Language analysis</term>
<term>Modeling</term>
<term>Optical character recognition</term>
<term>Semantic analysis</term>
<term>Semantics</term>
<term>Statistical model</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Chinois</term>
<term>Reconnaissance optique caractère</term>
<term>Analyse langage</term>
<term>Reconnaissance caractère</term>
<term>Sémantique</term>
<term>Modèle statistique</term>
<term>Modélisation</term>
<term>Analyse sémantique</term>
<term>Représentation connaissance</term>
<term>.</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper describes an effective spelling check approach for Chinese OCR with a new multiknowledge based statistical language model. This language model combines the conventional n-gram language model and the new LSA (Latent Semantic Analysis) language model, so both local information (syntax) and global information (semantic) are utilized. Furthermore, Chinese similar characters are used in Viterbi search process to expand the candidate list in order to add more possible correct results. With our approach, the best recognition accuracy rate increases from 79.3% to 91.9%, which means 60.9% error reduction.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Japon</li>
<li>République populaire de Chine</li>
</country>
<settlement>
<li>Pékin</li>
</settlement>
</list>
<tree>
<country name="République populaire de Chine">
<noRegion>
<name sortKey="Li Zhuang" sort="Li Zhuang" uniqKey="Li Zhuang" last="Li Zhuang">LI ZHUANG</name>
</noRegion>
<name sortKey="Chunheng Wang" sort="Chunheng Wang" uniqKey="Chunheng Wang" last="Chunheng Wang">CHUNHENG WANG</name>
<name sortKey="Li Zhuang" sort="Li Zhuang" uniqKey="Li Zhuang" last="Li Zhuang">LI ZHUANG</name>
<name sortKey="Ta Bao" sort="Ta Bao" uniqKey="Ta Bao" last="Ta Bao">TA BAO</name>
<name sortKey="Ta Bao" sort="Ta Bao" uniqKey="Ta Bao" last="Ta Bao">TA BAO</name>
<name sortKey="Xiaoyan Zhu" sort="Xiaoyan Zhu" uniqKey="Xiaoyan Zhu" last="Xiaoyan Zhu">XIAOYAN ZHU</name>
<name sortKey="Xiaoyan Zhu" sort="Xiaoyan Zhu" uniqKey="Xiaoyan Zhu" last="Xiaoyan Zhu">XIAOYAN ZHU</name>
</country>
<country name="Japon">
<noRegion>
<name sortKey="Naoi, Satoshi" sort="Naoi, Satoshi" uniqKey="Naoi S" first="Satoshi" last="Naoi">Satoshi Naoi</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001732 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 001732 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     Pascal:06-0112413
   |texte=   A Chinese OCR spelling check approach based on statistical language models
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024