A Chinese OCR spelling check approach based on statistical language models
Identifieur interne : 001732 ( Main/Merge ); précédent : 001731; suivant : 001733A Chinese OCR spelling check approach based on statistical language models
Auteurs : LI ZHUANG [République populaire de Chine] ; TA BAO [République populaire de Chine] ; XIAOYAN ZHU [République populaire de Chine] ; CHUNHENG WANG [République populaire de Chine] ; Satoshi Naoi [Japon]Source :
Descripteurs français
- Pascal (Inist)
English descriptors
- KwdEn :
Abstract
This paper describes an effective spelling check approach for Chinese OCR with a new multiknowledge based statistical language model. This language model combines the conventional n-gram language model and the new LSA (Latent Semantic Analysis) language model, so both local information (syntax) and global information (semantic) are utilized. Furthermore, Chinese similar characters are used in Viterbi search process to expand the candidate list in order to add more possible correct results. With our approach, the best recognition accuracy rate increases from 79.3% to 91.9%, which means 60.9% error reduction.
Links toward previous steps (curation, corpus...)
- to stream PascalFrancis, to step Corpus: 000405
- to stream PascalFrancis, to step Curation: 000382
- to stream PascalFrancis, to step Checkpoint: 000521
Links to Exploration step
Pascal:06-0112413Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">A Chinese OCR spelling check approach based on statistical language models</title>
<author><name sortKey="Li Zhuang" sort="Li Zhuang" uniqKey="Li Zhuang" last="Li Zhuang">LI ZHUANG</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Ta Bao" sort="Ta Bao" uniqKey="Ta Bao" last="Ta Bao">TA BAO</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Xiaoyan Zhu" sort="Xiaoyan Zhu" uniqKey="Xiaoyan Zhu" last="Xiaoyan Zhu">XIAOYAN ZHU</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Chunheng Wang" sort="Chunheng Wang" uniqKey="Chunheng Wang" last="Chunheng Wang">CHUNHENG WANG</name>
<affiliation wicri:level="3"><inist:fA14 i1="03"><s1>Fujitsu R&D Center Co. Ltd</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Naoi, Satoshi" sort="Naoi, Satoshi" uniqKey="Naoi S" first="Satoshi" last="Naoi">Satoshi Naoi</name>
<affiliation wicri:level="1"><inist:fA14 i1="04"><s1>Satosm Naof Fujitsu Laboratories Ltd</s1>
<s2>Kawasaki</s2>
<s3>JPN</s3>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Satosm Naof Fujitsu Laboratories Ltd</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">06-0112413</idno>
<date when="2004">2004</date>
<idno type="stanalyst">PASCAL 06-0112413 INIST</idno>
<idno type="RBID">Pascal:06-0112413</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000405</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000382</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000521</idno>
<idno type="wicri:Area/Main/Merge">001732</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">A Chinese OCR spelling check approach based on statistical language models</title>
<author><name sortKey="Li Zhuang" sort="Li Zhuang" uniqKey="Li Zhuang" last="Li Zhuang">LI ZHUANG</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Ta Bao" sort="Ta Bao" uniqKey="Ta Bao" last="Ta Bao">TA BAO</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Xiaoyan Zhu" sort="Xiaoyan Zhu" uniqKey="Xiaoyan Zhu" last="Xiaoyan Zhu">XIAOYAN ZHU</name>
<affiliation wicri:level="3"><inist:fA14 i1="01"><s1>DCST, Tsinghua University</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
<affiliation wicri:level="3"><inist:fA14 i1="02"><s1>State Key Laboratory of Intelligent Technology and Systems</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Chunheng Wang" sort="Chunheng Wang" uniqKey="Chunheng Wang" last="Chunheng Wang">CHUNHENG WANG</name>
<affiliation wicri:level="3"><inist:fA14 i1="03"><s1>Fujitsu R&D Center Co. Ltd</s1>
<s2>Beijing</s2>
<s3>CHN</s3>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>République populaire de Chine</country>
<placeName><settlement type="city">Pékin</settlement>
</placeName>
</affiliation>
</author>
<author><name sortKey="Naoi, Satoshi" sort="Naoi, Satoshi" uniqKey="Naoi S" first="Satoshi" last="Naoi">Satoshi Naoi</name>
<affiliation wicri:level="1"><inist:fA14 i1="04"><s1>Satosm Naof Fujitsu Laboratories Ltd</s1>
<s2>Kawasaki</s2>
<s3>JPN</s3>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<wicri:noRegion>Satosm Naof Fujitsu Laboratories Ltd</wicri:noRegion>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Chinese</term>
<term>Knowledge representation</term>
<term>Language analysis</term>
<term>Modeling</term>
<term>Optical character recognition</term>
<term>Semantic analysis</term>
<term>Semantics</term>
<term>Statistical model</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Chinois</term>
<term>Reconnaissance optique caractère</term>
<term>Analyse langage</term>
<term>Reconnaissance caractère</term>
<term>Sémantique</term>
<term>Modèle statistique</term>
<term>Modélisation</term>
<term>Analyse sémantique</term>
<term>Représentation connaissance</term>
<term>.</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This paper describes an effective spelling check approach for Chinese OCR with a new multiknowledge based statistical language model. This language model combines the conventional n-gram language model and the new LSA (Latent Semantic Analysis) language model, so both local information (syntax) and global information (semantic) are utilized. Furthermore, Chinese similar characters are used in Viterbi search process to expand the candidate list in order to add more possible correct results. With our approach, the best recognition accuracy rate increases from 79.3% to 91.9%, which means 60.9% error reduction.</div>
</front>
</TEI>
<affiliations><list><country><li>Japon</li>
<li>République populaire de Chine</li>
</country>
<settlement><li>Pékin</li>
</settlement>
</list>
<tree><country name="République populaire de Chine"><noRegion><name sortKey="Li Zhuang" sort="Li Zhuang" uniqKey="Li Zhuang" last="Li Zhuang">LI ZHUANG</name>
</noRegion>
<name sortKey="Chunheng Wang" sort="Chunheng Wang" uniqKey="Chunheng Wang" last="Chunheng Wang">CHUNHENG WANG</name>
<name sortKey="Li Zhuang" sort="Li Zhuang" uniqKey="Li Zhuang" last="Li Zhuang">LI ZHUANG</name>
<name sortKey="Ta Bao" sort="Ta Bao" uniqKey="Ta Bao" last="Ta Bao">TA BAO</name>
<name sortKey="Ta Bao" sort="Ta Bao" uniqKey="Ta Bao" last="Ta Bao">TA BAO</name>
<name sortKey="Xiaoyan Zhu" sort="Xiaoyan Zhu" uniqKey="Xiaoyan Zhu" last="Xiaoyan Zhu">XIAOYAN ZHU</name>
<name sortKey="Xiaoyan Zhu" sort="Xiaoyan Zhu" uniqKey="Xiaoyan Zhu" last="Xiaoyan Zhu">XIAOYAN ZHU</name>
</country>
<country name="Japon"><noRegion><name sortKey="Naoi, Satoshi" sort="Naoi, Satoshi" uniqKey="Naoi S" first="Satoshi" last="Naoi">Satoshi Naoi</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001732 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 001732 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Merge |type= RBID |clé= Pascal:06-0112413 |texte= A Chinese OCR spelling check approach based on statistical language models }}
This area was generated with Dilib version V0.6.32. |