Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Multilingual machine printed OCR

Identifieur interne : 001C94 ( Main/Merge ); précédent : 001C93; suivant : 001C95

Multilingual machine printed OCR

Auteurs : Premkumar Natarajan [États-Unis] ; ZHIDONG LU [États-Unis] ; Richard Schwartz [États-Unis] ; Issam Bazzi [États-Unis] ; John Makhoul [États-Unis]

Source :

RBID : Pascal:01-0202799

Descripteurs français

English descriptors

Abstract

This paper presents a script-independent methodology for optical character recognition (OCR) based on the use of hidden Markov models (HMM). The feature extraction, training and recognition components of the system are all designed to be script independent. The training and recognition components were taken without modification from a continuous speech recognition system; the only component that is specific to OCR is the feature extraction component. To port the system to a new language, all that is needed is text image training data from the new language, along with ground truth which gives the identity of the sequences of characters along each line of each text image, without specifying the location of the characters on the image. The parameters of the character HMMs are estimated automatically from the training data, without the need for laborious handwritten rules. The system does not require presegmentation of the data, neither at the word level nor at the character level. Thus, the system is able to handle languages with connected characters in a straightforward manner. The script independence of the system is demonstrated in three languages with different types of script: Arabic, English, and Chinese. The robustness of the system is further demonstrated by testing the system on fax data. An unsupervised adaptation method is then described to improve performance under degraded conditions.

Links toward previous steps (curation, corpus...)


Links to Exploration step

Pascal:01-0202799

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Multilingual machine printed OCR</title>
<author>
<name sortKey="Natarajan, Premkumar" sort="Natarajan, Premkumar" uniqKey="Natarajan P" first="Premkumar" last="Natarajan">Premkumar Natarajan</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>BBN Technologies, Verizon</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Zhidong Lu" sort="Zhidong Lu" uniqKey="Zhidong Lu" last="Zhidong Lu">ZHIDONG LU</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>BBN Technologies, Verizon</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Schwartz, Richard" sort="Schwartz, Richard" uniqKey="Schwartz R" first="Richard" last="Schwartz">Richard Schwartz</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>BBN Technologies, Verizon</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Bazzi, Issam" sort="Bazzi, Issam" uniqKey="Bazzi I" first="Issam" last="Bazzi">Issam Bazzi</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>BBN Technologies, Verizon</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Makhoul, John" sort="Makhoul, John" uniqKey="Makhoul J" first="John" last="Makhoul">John Makhoul</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>BBN Technologies, Verizon</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">01-0202799</idno>
<date when="2001">2001</date>
<idno type="stanalyst">PASCAL 01-0202799 INIST</idno>
<idno type="RBID">Pascal:01-0202799</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000727</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000066</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000667</idno>
<idno type="wicri:doubleKey">0218-0014:2001:Natarajan P:multilingual:machine:printed</idno>
<idno type="wicri:Area/Main/Merge">001C94</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Multilingual machine printed OCR</title>
<author>
<name sortKey="Natarajan, Premkumar" sort="Natarajan, Premkumar" uniqKey="Natarajan P" first="Premkumar" last="Natarajan">Premkumar Natarajan</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>BBN Technologies, Verizon</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Zhidong Lu" sort="Zhidong Lu" uniqKey="Zhidong Lu" last="Zhidong Lu">ZHIDONG LU</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>BBN Technologies, Verizon</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Schwartz, Richard" sort="Schwartz, Richard" uniqKey="Schwartz R" first="Richard" last="Schwartz">Richard Schwartz</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>BBN Technologies, Verizon</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Bazzi, Issam" sort="Bazzi, Issam" uniqKey="Bazzi I" first="Issam" last="Bazzi">Issam Bazzi</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>BBN Technologies, Verizon</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Makhoul, John" sort="Makhoul, John" uniqKey="Makhoul J" first="John" last="Makhoul">John Makhoul</name>
<affiliation wicri:level="2">
<inist:fA14 i1="01">
<s1>BBN Technologies, Verizon</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
<sZ>5 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">International journal of pattern recognition and artificial intelligence</title>
<title level="j" type="abbreviated">Int. j. pattern recogn. artif. intell.</title>
<idno type="ISSN">0218-0014</idno>
<imprint>
<date when="2001">2001</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">International journal of pattern recognition and artificial intelligence</title>
<title level="j" type="abbreviated">Int. j. pattern recogn. artif. intell.</title>
<idno type="ISSN">0218-0014</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Character recognition</term>
<term>Continuous system</term>
<term>Facsimile</term>
<term>Image databank</term>
<term>Image segmentation</term>
<term>Localization</term>
<term>Manuscript character</term>
<term>Markov model</term>
<term>Markov process</term>
<term>Multilingualism</term>
<term>Optical character recognition</term>
<term>Pattern extraction</term>
<term>Pattern recognition</term>
<term>Speech recognition</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance parole</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance forme</term>
<term>Localisation</term>
<term>Reconnaissance optique caractère</term>
<term>Système continu</term>
<term>Banque image</term>
<term>Caractère manuscrit</term>
<term>Extraction forme</term>
<term>Modèle Markov</term>
<term>Processus Markov</term>
<term>Télécopie</term>
<term>Segmentation image</term>
<term>Multilinguisme</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Télécopie</term>
<term>Multilinguisme</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper presents a script-independent methodology for optical character recognition (OCR) based on the use of hidden Markov models (HMM). The feature extraction, training and recognition components of the system are all designed to be script independent. The training and recognition components were taken without modification from a continuous speech recognition system; the only component that is specific to OCR is the feature extraction component. To port the system to a new language, all that is needed is text image training data from the new language, along with ground truth which gives the identity of the sequences of characters along each line of each text image, without specifying the location of the characters on the image. The parameters of the character HMMs are estimated automatically from the training data, without the need for laborious handwritten rules. The system does not require presegmentation of the data, neither at the word level nor at the character level. Thus, the system is able to handle languages with connected characters in a straightforward manner. The script independence of the system is demonstrated in three languages with different types of script: Arabic, English, and Chinese. The robustness of the system is further demonstrated by testing the system on fax data. An unsupervised adaptation method is then described to improve performance under degraded conditions.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Massachusetts</li>
</region>
</list>
<tree>
<country name="États-Unis">
<region name="Massachusetts">
<name sortKey="Natarajan, Premkumar" sort="Natarajan, Premkumar" uniqKey="Natarajan P" first="Premkumar" last="Natarajan">Premkumar Natarajan</name>
</region>
<name sortKey="Bazzi, Issam" sort="Bazzi, Issam" uniqKey="Bazzi I" first="Issam" last="Bazzi">Issam Bazzi</name>
<name sortKey="Makhoul, John" sort="Makhoul, John" uniqKey="Makhoul J" first="John" last="Makhoul">John Makhoul</name>
<name sortKey="Schwartz, Richard" sort="Schwartz, Richard" uniqKey="Schwartz R" first="Richard" last="Schwartz">Richard Schwartz</name>
<name sortKey="Zhidong Lu" sort="Zhidong Lu" uniqKey="Zhidong Lu" last="Zhidong Lu">ZHIDONG LU</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Merge
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 001C94 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Merge/biblio.hfd -nk 001C94 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Merge
   |type=    RBID
   |clé=     Pascal:01-0202799
   |texte=   Multilingual machine printed OCR
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024