Word–Wise Script Identification from Indian Documents
Identifieur interne : 000491 ( Istex/Curation ); précédent : 000490; suivant : 000492Word–Wise Script Identification from Indian Documents
Auteurs : Suranjit Sinha [Inde] ; Umapada Pal [Inde] ; B. Chaudhuri [Inde]Source :
- Lecture Notes in Computer Science [ 0302-9743 ] ; 2004.
Abstract
Abstract: In a country like India, a single text line of most of the official documents contains two different script words. Under two-language formula, the Indian documents are written in English and the state official language. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate different script words before feeding them to the OCRs of individual scripts. In this paper a robust technique is proposed to extract word-wise script identification from Indian doublet form documents. Here, at first, the document is segmented into lines and then the lines are segmented into words. Using different topological and structural features (like number of loops, headline feature, water reservoir concept based features, profile features, etc.) individual script words are identified from the documents. The proposed scheme is tested on 24210 words of different doublets and we received more than 97% accuracy, on average.
Url:
DOI: 10.1007/978-3-540-28640-0_29
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: Pour aller vers cette notice dans l'étape Curation :000498
Links to Exploration step
ISTEX:DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97ELe document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Word–Wise Script Identification from Indian Documents</title>
<author><name sortKey="Sinha, Suranjit" sort="Sinha, Suranjit" uniqKey="Sinha S" first="Suranjit" last="Sinha">Suranjit Sinha</name>
<affiliation wicri:level="1"><mods:affiliation>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata, India</mods:affiliation>
<country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Pal, Umapada" sort="Pal, Umapada" uniqKey="Pal U" first="Umapada" last="Pal">Umapada Pal</name>
<affiliation wicri:level="1"><mods:affiliation>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata, India</mods:affiliation>
<country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: umapada@isical.ac.in</mods:affiliation>
<country wicri:rule="url">Inde</country>
</affiliation>
</author>
<author><name sortKey="Chaudhuri, B" sort="Chaudhuri, B" uniqKey="Chaudhuri B" first="B." last="Chaudhuri">B. Chaudhuri</name>
<affiliation wicri:level="1"><mods:affiliation>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata, India</mods:affiliation>
<country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97E</idno>
<date when="2004" year="2004">2004</date>
<idno type="doi">10.1007/978-3-540-28640-0_29</idno>
<idno type="url">https://api.istex.fr/document/DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97E/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000498</idno>
<idno type="wicri:Area/Istex/Curation">000491</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Word–Wise Script Identification from Indian Documents</title>
<author><name sortKey="Sinha, Suranjit" sort="Sinha, Suranjit" uniqKey="Sinha S" first="Suranjit" last="Sinha">Suranjit Sinha</name>
<affiliation wicri:level="1"><mods:affiliation>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata, India</mods:affiliation>
<country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata</wicri:regionArea>
</affiliation>
</author>
<author><name sortKey="Pal, Umapada" sort="Pal, Umapada" uniqKey="Pal U" first="Umapada" last="Pal">Umapada Pal</name>
<affiliation wicri:level="1"><mods:affiliation>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata, India</mods:affiliation>
<country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1"><mods:affiliation>E-mail: umapada@isical.ac.in</mods:affiliation>
<country wicri:rule="url">Inde</country>
</affiliation>
</author>
<author><name sortKey="Chaudhuri, B" sort="Chaudhuri, B" uniqKey="Chaudhuri B" first="B." last="Chaudhuri">B. Chaudhuri</name>
<affiliation wicri:level="1"><mods:affiliation>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata, India</mods:affiliation>
<country xml:lang="fr">Inde</country>
<wicri:regionArea>Computer Vision and Pattern Recognition Unit, Indian Statistical Unit, 203 B.T. Road, 700 108, Kolkata</wicri:regionArea>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2004</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97E</idno>
<idno type="DOI">10.1007/978-3-540-28640-0_29</idno>
<idno type="ChapterID">29</idno>
<idno type="ChapterID">Chap29</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: In a country like India, a single text line of most of the official documents contains two different script words. Under two-language formula, the Indian documents are written in English and the state official language. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate different script words before feeding them to the OCRs of individual scripts. In this paper a robust technique is proposed to extract word-wise script identification from Indian doublet form documents. Here, at first, the document is segmented into lines and then the lines are segmented into words. Using different topological and structural features (like number of loops, headline feature, water reservoir concept based features, profile features, etc.) individual script words are identified from the documents. The proposed scheme is tested on 24210 words of different doublets and we received more than 97% accuracy, on average.</div>
</front>
</TEI>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Istex/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000491 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Istex/Curation/biblio.hfd -nk 000491 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Istex |étape= Curation |type= RBID |clé= ISTEX:DF8BFCAE28D0DD31D95FD2F67000772E8B8DB97E |texte= Word–Wise Script Identification from Indian Documents }}
This area was generated with Dilib version V0.6.32. |