Serveur d'exploration sur la recherche en informatique en Lorraine

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Recognition of printed arabic text based on global features and decision tree learning techniques

Identifieur interne : 000157 ( Istex/Corpus ); précédent : 000156; suivant : 000158

Recognition of printed arabic text based on global features and decision tree learning techniques

Auteurs : Adnan Amin

Source :

RBID : ISTEX:075C2186877896091C7ED5D01E963553ED179886

English descriptors

Abstract

Abstract: Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and off-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research field. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between different fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The first step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction, where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character, etc., to avoid the difficulty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with different fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%.

Url:
DOI: 10.1016/S0031-3203(99)00114-4

Links to Exploration step

ISTEX:075C2186877896091C7ED5D01E963553ED179886

Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Recognition of printed arabic text based on global features and decision tree learning techniques</title>
<author>
<name sortKey="Amin, Adnan" sort="Amin, Adnan" uniqKey="Amin A" first="Adnan" last="Amin">Adnan Amin</name>
<affiliation>
<mods:affiliation>School of Computer Science and Engineering, University of New South Wales, 2052 Sydney, Australia</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>E-mail: amin@cse.unsw.edu.au</mods:affiliation>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:075C2186877896091C7ED5D01E963553ED179886</idno>
<date when="2000" year="2000">2000</date>
<idno type="doi">10.1016/S0031-3203(99)00114-4</idno>
<idno type="url">https://api.istex.fr/ark:/67375/6H6-75HWFV2G-B/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000157</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">000157</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a">Recognition of printed arabic text based on global features and decision tree learning techniques</title>
<author>
<name sortKey="Amin, Adnan" sort="Amin, Adnan" uniqKey="Amin A" first="Adnan" last="Amin">Adnan Amin</name>
<affiliation>
<mods:affiliation>School of Computer Science and Engineering, University of New South Wales, 2052 Sydney, Australia</mods:affiliation>
</affiliation>
<affiliation>
<mods:affiliation>E-mail: amin@cse.unsw.edu.au</mods:affiliation>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="j">Pattern Recognition</title>
<title level="j" type="abbrev">PR</title>
<idno type="ISSN">0031-3203</idno>
<imprint>
<publisher>ELSEVIER</publisher>
<date type="published" when="2000">2000</date>
<biblScope unit="volume">33</biblScope>
<biblScope unit="issue">8</biblScope>
<biblScope unit="page" from="1309">1309</biblScope>
<biblScope unit="page" to="1323">1323</biblScope>
</imprint>
<idno type="ISSN">0031-3203</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0031-3203</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="Teeft" xml:lang="en">
<term>Adequate support</term>
<term>Algorithm</term>
<term>Amin</term>
<term>Amin pattern recognition</term>
<term>Arabic</term>
<term>Arabic character recognition</term>
<term>Arabic characters</term>
<term>Arabic letter</term>
<term>Arabic text</term>
<term>Arabic word</term>
<term>Arabic words</term>
<term>Automatic recognition</term>
<term>Baseline</term>
<term>Binary image</term>
<term>Black pixels</term>
<term>Block diagram</term>
<term>Character recognition</term>
<term>Comparaison dynamique</term>
<term>Complementary character</term>
<term>Complementary characters</term>
<term>Complete representation</term>
<term>Computer recognition</term>
<term>Computer science</term>
<term>Cursive</term>
<term>Cursive nature</term>
<term>Decision tree</term>
<term>Document analysis</term>
<term>Elsevier science</term>
<term>Error rate</term>
<term>Error rates performance</term>
<term>Feature extraction</term>
<term>First kuwait computer conference</term>
<term>Global features</term>
<term>Handprinted characters</term>
<term>Handwriting recognition</term>
<term>Handwritten</term>
<term>Handwritten characters</term>
<term>Horizontal projection</term>
<term>Ieee</term>
<term>Ieee trans</term>
<term>Inner contours</term>
<term>Intensive research</term>
<term>International conference</term>
<term>Japanese characters</term>
<term>Large number</term>
<term>Leaf node</term>
<term>Leaf nodes</term>
<term>Line segment</term>
<term>Machine recognition</term>
<term>Markov models</term>
<term>National computer conference</term>
<term>Node</term>
<term>Other utilities</term>
<term>Pattern recognition</term>
<term>Pattern recognition society</term>
<term>Pixel</term>
<term>Previous scanline</term>
<term>Recognition process</term>
<term>Recognition rate</term>
<term>Same font</term>
<term>Same shape</term>
<term>Segmentation</term>
<term>Segmentation stage</term>
<term>Subwords</term>
<term>Successive scanlines</term>
<term>Symbolic machine</term>
<term>Syntactical pattern recognition</term>
<term>Technical papers</term>
<term>Text recognition</term>
<term>Uniform population</term>
<term>Vowel diacritics</term>
<term>White pixels</term>
</keywords>
</textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and off-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research field. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between different fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The first step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction, where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character, etc., to avoid the difficulty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with different fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%.</div>
</front>
</TEI>
<istex>
<corpusName>elsevier</corpusName>
<keywords>
<teeft>
<json:string>amin</json:string>
<json:string>pixel</json:string>
<json:string>pattern recognition</json:string>
<json:string>amin pattern recognition</json:string>
<json:string>cursive</json:string>
<json:string>handwritten</json:string>
<json:string>subwords</json:string>
<json:string>decision tree</json:string>
<json:string>arabic text</json:string>
<json:string>arabic characters</json:string>
<json:string>algorithm</json:string>
<json:string>segmentation</json:string>
<json:string>ieee</json:string>
<json:string>node</json:string>
<json:string>arabic word</json:string>
<json:string>black pixels</json:string>
<json:string>arabic words</json:string>
<json:string>arabic</json:string>
<json:string>complementary characters</json:string>
<json:string>automatic recognition</json:string>
<json:string>international conference</json:string>
<json:string>character recognition</json:string>
<json:string>baseline</json:string>
<json:string>global features</json:string>
<json:string>recognition rate</json:string>
<json:string>arabic letter</json:string>
<json:string>ieee trans</json:string>
<json:string>computer science</json:string>
<json:string>line segment</json:string>
<json:string>horizontal projection</json:string>
<json:string>machine recognition</json:string>
<json:string>document analysis</json:string>
<json:string>recognition process</json:string>
<json:string>markov models</json:string>
<json:string>symbolic machine</json:string>
<json:string>handwriting recognition</json:string>
<json:string>japanese characters</json:string>
<json:string>complementary character</json:string>
<json:string>syntactical pattern recognition</json:string>
<json:string>vowel diacritics</json:string>
<json:string>segmentation stage</json:string>
<json:string>same shape</json:string>
<json:string>intensive research</json:string>
<json:string>same font</json:string>
<json:string>adequate support</json:string>
<json:string>pattern recognition society</json:string>
<json:string>successive scanlines</json:string>
<json:string>previous scanline</json:string>
<json:string>elsevier science</json:string>
<json:string>binary image</json:string>
<json:string>technical papers</json:string>
<json:string>white pixels</json:string>
<json:string>inner contours</json:string>
<json:string>other utilities</json:string>
<json:string>leaf nodes</json:string>
<json:string>leaf node</json:string>
<json:string>uniform population</json:string>
<json:string>complete representation</json:string>
<json:string>error rates performance</json:string>
<json:string>large number</json:string>
<json:string>handwritten characters</json:string>
<json:string>error rate</json:string>
<json:string>cursive nature</json:string>
<json:string>feature extraction</json:string>
<json:string>handprinted characters</json:string>
<json:string>text recognition</json:string>
<json:string>comparaison dynamique</json:string>
<json:string>first kuwait computer conference</json:string>
<json:string>national computer conference</json:string>
<json:string>computer recognition</json:string>
<json:string>arabic character recognition</json:string>
<json:string>block diagram</json:string>
</teeft>
</keywords>
<author>
<json:item>
<name>Adnan Amin</name>
<affiliations>
<json:string>School of Computer Science and Engineering, University of New South Wales, 2052 Sydney, Australia</json:string>
<json:string>E-mail: amin@cse.unsw.edu.au</json:string>
</affiliations>
</json:item>
</author>
<subject>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Pattern recognition</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Printed Arabic text</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Connected component</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Skew detection and correction</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Global features</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Structural classification</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Machine learning C4.5</value>
</json:item>
<json:item>
<lang>
<json:string>eng</json:string>
</lang>
<value>Cross-validation</value>
</json:item>
</subject>
<arkIstex>ark:/67375/6H6-75HWFV2G-B</arkIstex>
<language>
<json:string>eng</json:string>
</language>
<originalGenre>
<json:string>Full-length article</json:string>
</originalGenre>
<abstract>Abstract: Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and off-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research field. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between different fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The first step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction, where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character, etc., to avoid the difficulty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with different fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%.</abstract>
<qualityIndicators>
<score>10</score>
<pdfWordCount>6396</pdfWordCount>
<pdfCharCount>36890</pdfCharCount>
<pdfVersion>1.2</pdfVersion>
<pdfPageCount>15</pdfPageCount>
<pdfPageSize>506 x 698 pts</pdfPageSize>
<refBibsNative>true</refBibsNative>
<abstractWordCount>269</abstractWordCount>
<abstractCharCount>1706</abstractCharCount>
<keywordCount>8</keywordCount>
</qualityIndicators>
<title>Recognition of printed arabic text based on global features and decision tree learning techniques</title>
<pii>
<json:string>S0031-3203(99)00114-4</json:string>
</pii>
<genre>
<json:string>research-article</json:string>
</genre>
<host>
<title>Pattern Recognition</title>
<language>
<json:string>unknown</json:string>
</language>
<publicationDate>2000</publicationDate>
<issn>
<json:string>0031-3203</json:string>
</issn>
<pii>
<json:string>S0031-3203(00)X0066-0</json:string>
</pii>
<volume>33</volume>
<issue>8</issue>
<pages>
<first>1309</first>
<last>1323</last>
</pages>
<genre>
<json:string>journal</json:string>
</genre>
</host>
<namedEntities>
<unitex>
<date>
<json:string>2000</json:string>
<json:string>1000</json:string>
</date>
<geogName></geogName>
<orgName>
<json:string>Metrology Organization</json:string>
<json:string>Elsevier Science Ltd.</json:string>
<json:string>ASCII</json:string>
<json:string>ASMO</json:string>
<json:string>American Standard Code for Information Interchange</json:string>
</orgName>
<orgName_funder></orgName_funder>
<orgName_provider></orgName_provider>
<persName>
<json:string>Arabic</json:string>
<json:string>A. Amin</json:string>
<json:string>The</json:string>
</persName>
<placeName></placeName>
<ref_url></ref_url>
<ref_bibl>
<json:string>University of New South Wales, 2052</json:string>
<json:string>[21,22]</json:string>
<json:string>[23]</json:string>
<json:string>[37,38]</json:string>
<json:string>[44]</json:string>
<json:string>[8]</json:string>
<json:string>[43]</json:string>
<json:string>[16]</json:string>
<json:string>[3,4]</json:string>
<json:string>[5]</json:string>
<json:string>[41]</json:string>
<json:string>[1,2]</json:string>
<json:string>[40]</json:string>
</ref_bibl>
<bibl></bibl>
</unitex>
</namedEntities>
<ark>
<json:string>ark:/67375/6H6-75HWFV2G-B</json:string>
</ark>
<categories>
<wos>
<json:string>1 - science</json:string>
<json:string>2 - engineering, electrical & electronic</json:string>
<json:string>2 - computer science, artificial intelligence</json:string>
</wos>
<scienceMetrix>
<json:string>1 - applied sciences</json:string>
<json:string>2 - information & communication technologies</json:string>
<json:string>3 - artificial intelligence & image processing</json:string>
</scienceMetrix>
<scopus>
<json:string>1 - Physical Sciences</json:string>
<json:string>2 - Computer Science</json:string>
<json:string>3 - Artificial Intelligence</json:string>
<json:string>1 - Physical Sciences</json:string>
<json:string>2 - Computer Science</json:string>
<json:string>3 - Computer Vision and Pattern Recognition</json:string>
<json:string>1 - Physical Sciences</json:string>
<json:string>2 - Computer Science</json:string>
<json:string>3 - Signal Processing</json:string>
<json:string>1 - Physical Sciences</json:string>
<json:string>2 - Computer Science</json:string>
<json:string>3 - Software</json:string>
</scopus>
<inist>
<json:string>1 - sciences humaines et sociales</json:string>
</inist>
</categories>
<publicationDate>2000</publicationDate>
<copyrightDate>2000</copyrightDate>
<doi>
<json:string>10.1016/S0031-3203(99)00114-4</json:string>
</doi>
<id>075C2186877896091C7ED5D01E963553ED179886</id>
<score>1</score>
<fulltext>
<json:item>
<extension>pdf</extension>
<original>true</original>
<mimetype>application/pdf</mimetype>
<uri>https://api.istex.fr/ark:/67375/6H6-75HWFV2G-B/fulltext.pdf</uri>
</json:item>
<json:item>
<extension>zip</extension>
<original>false</original>
<mimetype>application/zip</mimetype>
<uri>https://api.istex.fr/ark:/67375/6H6-75HWFV2G-B/bundle.zip</uri>
</json:item>
<istex:fulltextTEI uri="https://api.istex.fr/ark:/67375/6H6-75HWFV2G-B/fulltext.tei">
<teiHeader>
<fileDesc>
<titleStmt>
<title level="a">Recognition of printed arabic text based on global features and decision tree learning techniques</title>
</titleStmt>
<publicationStmt>
<authority>ISTEX</authority>
<publisher scheme="https://scientific-publisher.data.istex.fr">ELSEVIER</publisher>
<availability>
<licence>
<p>©2000 Pattern Recognition Society</p>
</licence>
<p scheme="https://loaded-corpus.data.istex.fr/ark:/67375/XBH-HKKZVM7B-M">elsevier</p>
</availability>
<date>2000</date>
</publicationStmt>
<notesStmt>
<note type="research-article" scheme="https://content-type.data.istex.fr/ark:/67375/XTP-1JC4F85T-7">research-article</note>
<note type="journal" scheme="https://publication-type.data.istex.fr/ark:/67375/JMC-0GLKJH51-B">journal</note>
<note type="content">Fig. 1: Block diagram of the system.</note>
<note type="content">Fig. 2: Different shapes of the Arabic letter “A' ” in (a)beginning, (b)middle, (c)end, (d)isolated.</note>
<note type="content">Fig. 3: Arabic characters differing only with regard to the position and number of associated dots.</note>
<note type="content">Fig. 4: Arabic words with constituent subwords.</note>
<note type="content">Fig. 5: Different styles and fonts for the writing of arabic text.</note>
<note type="content">Fig. 6: The process of building connected components from image scanlines.</note>
<note type="content">Fig. 7: An example of an Arabic text with connected components.</note>
<note type="content">Fig. 8: Merging criteria of a cc into an existing group.</note>
<note type="content">Fig. 9: (a)The input image (b)horizontal projection of the input word, (c)result of the algorithm.</note>
<note type="content">Fig. 10: The inner and outer contour of an Arabic word.</note>
<note type="content">Fig. 11: An of segmentation of the word into peaks (a)Arabic word (b), (c)histogram before and after smoothing.</note>
<note type="content">Fig. 12: Segment encoding for the C4.5 machine learning.</note>
<note type="content">Fig. 13: The complete representation of word shown in Fig. 11 for the C4.5 machine learning.</note>
<note type="content">Fig. 14: Samples of Arabic words used in experiments.</note>
<note type="content">Table 1: Comparison of various scripts</note>
<note type="content">Table 2: The basic alphabets of Arabic characters and their shapes at different positions in the word</note>
<note type="content">Table 3: Error rates performance using 10-fold cross validation</note>
</notesStmt>
<sourceDesc>
<biblStruct type="inbook">
<analytic>
<title level="a">Recognition of printed arabic text based on global features and decision tree learning techniques</title>
<author xml:id="author-0000">
<persName>
<forename type="first">Adnan</forename>
<surname>Amin</surname>
</persName>
<email>amin@cse.unsw.edu.au</email>
<affiliation>School of Computer Science and Engineering, University of New South Wales, 2052 Sydney, Australia</affiliation>
</author>
<idno type="istex">075C2186877896091C7ED5D01E963553ED179886</idno>
<idno type="ark">ark:/67375/6H6-75HWFV2G-B</idno>
<idno type="DOI">10.1016/S0031-3203(99)00114-4</idno>
<idno type="PII">S0031-3203(99)00114-4</idno>
</analytic>
<monogr>
<title level="j">Pattern Recognition</title>
<title level="j" type="abbrev">PR</title>
<idno type="pISSN">0031-3203</idno>
<idno type="PII">S0031-3203(00)X0066-0</idno>
<imprint>
<publisher>ELSEVIER</publisher>
<date type="published" when="2000"></date>
<biblScope unit="volume">33</biblScope>
<biblScope unit="issue">8</biblScope>
<biblScope unit="page" from="1309">1309</biblScope>
<biblScope unit="page" to="1323">1323</biblScope>
</imprint>
</monogr>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<creation>
<date>2000</date>
</creation>
<langUsage>
<language ident="en">en</language>
</langUsage>
<abstract xml:lang="en">
<p>Abstract: Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and off-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research field. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between different fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The first step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction, where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character, etc., to avoid the difficulty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with different fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%.</p>
</abstract>
<textClass>
<keywords scheme="keyword">
<list>
<head>Keywords</head>
<item>
<term>Pattern recognition</term>
</item>
<item>
<term>Printed Arabic text</term>
</item>
<item>
<term>Connected component</term>
</item>
<item>
<term>Skew detection and correction</term>
</item>
<item>
<term>Global features</term>
</item>
<item>
<term>Structural classification</term>
</item>
<item>
<term>Machine learning C4.5</term>
</item>
<item>
<term>Cross-validation</term>
</item>
</list>
</keywords>
</textClass>
</profileDesc>
<revisionDesc>
<change when="2000">Published</change>
</revisionDesc>
</teiHeader>
</istex:fulltextTEI>
<json:item>
<extension>txt</extension>
<original>false</original>
<mimetype>text/plain</mimetype>
<uri>https://api.istex.fr/ark:/67375/6H6-75HWFV2G-B/fulltext.txt</uri>
</json:item>
</fulltext>
<metadata>
<istex:metadataXml wicri:clean="Elsevier, elements deleted: ce:floats; body; tail">
<istex:xmlDeclaration>version="1.0" encoding="utf-8"</istex:xmlDeclaration>
<istex:docType PUBLIC="-//ES//DTD journal article DTD version 4.5.2//EN//XML" URI="art452.dtd" name="istex:docType">
<istex:entity SYSTEM="gr1" NDATA="IMAGE" name="gr1"></istex:entity>
<istex:entity SYSTEM="fx1" NDATA="IMAGE" name="fx1"></istex:entity>
<istex:entity SYSTEM="fx2" NDATA="IMAGE" name="fx2"></istex:entity>
<istex:entity SYSTEM="gr2" NDATA="IMAGE" name="gr2"></istex:entity>
<istex:entity SYSTEM="fx3" NDATA="IMAGE" name="fx3"></istex:entity>
<istex:entity SYSTEM="gr3" NDATA="IMAGE" name="gr3"></istex:entity>
<istex:entity SYSTEM="gr4" NDATA="IMAGE" name="gr4"></istex:entity>
<istex:entity SYSTEM="gr5" NDATA="IMAGE" name="gr5"></istex:entity>
<istex:entity SYSTEM="gr6" NDATA="IMAGE" name="gr6"></istex:entity>
<istex:entity SYSTEM="gr7" NDATA="IMAGE" name="gr7"></istex:entity>
<istex:entity SYSTEM="gr8" NDATA="IMAGE" name="gr8"></istex:entity>
<istex:entity SYSTEM="gr9" NDATA="IMAGE" name="gr9"></istex:entity>
<istex:entity SYSTEM="gr10" NDATA="IMAGE" name="gr10"></istex:entity>
<istex:entity SYSTEM="gr11" NDATA="IMAGE" name="gr11"></istex:entity>
<istex:entity SYSTEM="gr12" NDATA="IMAGE" name="gr12"></istex:entity>
<istex:entity SYSTEM="gr13" NDATA="IMAGE" name="gr13"></istex:entity>
<istex:entity SYSTEM="gr14" NDATA="IMAGE" name="gr14"></istex:entity>
</istex:docType>
<istex:document>
<converted-article version="4.5.2" docsubtype="fla">
<item-info>
<jid>PR</jid>
<aid>1118</aid>
<ce:pii>S0031-3203(99)00114-4</ce:pii>
<ce:doi>10.1016/S0031-3203(99)00114-4</ce:doi>
<ce:copyright type="society" year="2000">Pattern Recognition Society</ce:copyright>
</item-info>
<head>
<ce:title>Recognition of printed arabic text based on global features and decision tree learning techniques</ce:title>
<ce:author-group>
<ce:author>
<ce:given-name>Adnan</ce:given-name>
<ce:surname>Amin</ce:surname>
<ce:cross-ref refid="AUT2"></ce:cross-ref>
<ce:e-address>amin@cse.unsw.edu.au</ce:e-address>
</ce:author>
<ce:affiliation>
<ce:textfn>School of Computer Science and Engineering, University of New South Wales, 2052 Sydney, Australia</ce:textfn>
</ce:affiliation>
</ce:author-group>
<ce:date-received day="29" month="4" year="1998"></ce:date-received>
<ce:date-accepted day="4" month="5" year="1999"></ce:date-accepted>
<ce:abstract>
<ce:section-title>Abstract</ce:section-title>
<ce:abstract-sec>
<ce:simple-para>Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and off-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research field. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between different fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The first step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction, where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character, etc., to avoid the difficulty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with different fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%.</ce:simple-para>
</ce:abstract-sec>
</ce:abstract>
<ce:keywords class="keyword">
<ce:section-title>Keywords</ce:section-title>
<ce:keyword>
<ce:text>Pattern recognition</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Printed Arabic text</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Connected component</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Skew detection and correction</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Global features</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Structural classification</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Machine learning C4.5</ce:text>
</ce:keyword>
<ce:keyword>
<ce:text>Cross-validation</ce:text>
</ce:keyword>
</ce:keywords>
</head>
</converted-article>
</istex:document>
</istex:metadataXml>
<mods version="3.6">
<titleInfo>
<title>Recognition of printed arabic text based on global features and decision tree learning techniques</title>
</titleInfo>
<titleInfo type="alternative" contentType="CDATA">
<title>Recognition of printed arabic text based on global features and decision tree learning techniques</title>
</titleInfo>
<name type="personal">
<namePart type="given">Adnan</namePart>
<namePart type="family">Amin</namePart>
<affiliation>School of Computer Science and Engineering, University of New South Wales, 2052 Sydney, Australia</affiliation>
<affiliation>E-mail: amin@cse.unsw.edu.au</affiliation>
<role>
<roleTerm type="text">author</roleTerm>
</role>
</name>
<typeOfResource>text</typeOfResource>
<genre type="research-article" displayLabel="Full-length article" authority="ISTEX" authorityURI="https://content-type.data.istex.fr" valueURI="https://content-type.data.istex.fr/ark:/67375/XTP-1JC4F85T-7">research-article</genre>
<originInfo>
<publisher>ELSEVIER</publisher>
<dateIssued encoding="w3cdtf">2000</dateIssued>
<copyrightDate encoding="w3cdtf">2000</copyrightDate>
</originInfo>
<language>
<languageTerm type="code" authority="iso639-2b">eng</languageTerm>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
</language>
<abstract lang="en">Abstract: Machine simulation of human reading has been the subject of intensive research for almost three decades. A large number of research papers and reports have already been published on Latin, Chinese and Japanese characters. However, little work has been conducted on the automatic recognition of Arabic in both on-line and off-line, has been achieved towards the automatic recognition of Arabic characters. This is a result of the lack of adequate support in terms of funding, and other utilities such as Arabic text databases, dictionaries, etc., and of course because of the cursive nature of its writing rules, and this problem is still an open research field. This paper presents a new technique for the recognition of Arabic text using the C4.5 machine learning system. The advantage of machine learning are twofold: it can generalize over the large degree of variations between different fonts and writing style and recognition rules can be constructed by examples. The technique can be divided into three major steps. The first step is digitization and pre-processing to create connected component, detect the skew of a document image and correct it. Second, feature extraction, where global features of the input Arabic word is used to extract features such as number of subwords, number of peaks within the subword, number and position of the complementary character, etc., to avoid the difficulty of segmentation stage. Finally, machine learning C4.5 is used to generate a decision tree for classifying each word. The system was tested with 1000 Arabic words with different fonts (each word has 15 samples) and the correct average recognition rate obtained using cross-validation was 92%.</abstract>
<note type="content">Fig. 1: Block diagram of the system.</note>
<note type="content">Fig. 2: Different shapes of the Arabic letter “A' ” in (a)beginning, (b)middle, (c)end, (d)isolated.</note>
<note type="content">Fig. 3: Arabic characters differing only with regard to the position and number of associated dots.</note>
<note type="content">Fig. 4: Arabic words with constituent subwords.</note>
<note type="content">Fig. 5: Different styles and fonts for the writing of arabic text.</note>
<note type="content">Fig. 6: The process of building connected components from image scanlines.</note>
<note type="content">Fig. 7: An example of an Arabic text with connected components.</note>
<note type="content">Fig. 8: Merging criteria of a cc into an existing group.</note>
<note type="content">Fig. 9: (a)The input image (b)horizontal projection of the input word, (c)result of the algorithm.</note>
<note type="content">Fig. 10: The inner and outer contour of an Arabic word.</note>
<note type="content">Fig. 11: An of segmentation of the word into peaks (a)Arabic word (b), (c)histogram before and after smoothing.</note>
<note type="content">Fig. 12: Segment encoding for the C4.5 machine learning.</note>
<note type="content">Fig. 13: The complete representation of word shown in Fig. 11 for the C4.5 machine learning.</note>
<note type="content">Fig. 14: Samples of Arabic words used in experiments.</note>
<note type="content">Table 1: Comparison of various scripts</note>
<note type="content">Table 2: The basic alphabets of Arabic characters and their shapes at different positions in the word</note>
<note type="content">Table 3: Error rates performance using 10-fold cross validation</note>
<subject>
<genre>Keywords</genre>
<topic>Pattern recognition</topic>
<topic>Printed Arabic text</topic>
<topic>Connected component</topic>
<topic>Skew detection and correction</topic>
<topic>Global features</topic>
<topic>Structural classification</topic>
<topic>Machine learning C4.5</topic>
<topic>Cross-validation</topic>
</subject>
<relatedItem type="host">
<titleInfo>
<title>Pattern Recognition</title>
</titleInfo>
<titleInfo type="abbreviated">
<title>PR</title>
</titleInfo>
<genre type="journal" authority="ISTEX" authorityURI="https://publication-type.data.istex.fr" valueURI="https://publication-type.data.istex.fr/ark:/67375/JMC-0GLKJH51-B">journal</genre>
<originInfo>
<publisher>ELSEVIER</publisher>
<dateIssued encoding="w3cdtf">2000</dateIssued>
</originInfo>
<identifier type="ISSN">0031-3203</identifier>
<identifier type="PII">S0031-3203(00)X0066-0</identifier>
<part>
<date>2000</date>
<detail type="volume">
<number>33</number>
<caption>vol.</caption>
</detail>
<detail type="issue">
<number>8</number>
<caption>no.</caption>
</detail>
<extent unit="issue-pages">
<start>1263</start>
<end>1404</end>
</extent>
<extent unit="pages">
<start>1309</start>
<end>1323</end>
</extent>
</part>
</relatedItem>
<identifier type="istex">075C2186877896091C7ED5D01E963553ED179886</identifier>
<identifier type="ark">ark:/67375/6H6-75HWFV2G-B</identifier>
<identifier type="DOI">10.1016/S0031-3203(99)00114-4</identifier>
<identifier type="PII">S0031-3203(99)00114-4</identifier>
<accessCondition type="use and reproduction" contentType="copyright">©2000 Pattern Recognition Society</accessCondition>
<recordInfo>
<recordContentSource authority="ISTEX" authorityURI="https://loaded-corpus.data.istex.fr" valueURI="https://loaded-corpus.data.istex.fr/ark:/67375/XBH-HKKZVM7B-M">elsevier</recordContentSource>
<recordOrigin>Pattern Recognition Society, ©2000</recordOrigin>
</recordInfo>
</mods>
<json:item>
<extension>json</extension>
<original>false</original>
<mimetype>application/json</mimetype>
<uri>https://api.istex.fr/ark:/67375/6H6-75HWFV2G-B/record.json</uri>
</json:item>
</metadata>
<serie></serie>
</istex>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Lorraine/explor/InforLorV4/Data/Istex/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000157 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Istex/Corpus/biblio.hfd -nk 000157 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Lorraine
   |area=    InforLorV4
   |flux=    Istex
   |étape=   Corpus
   |type=    RBID
   |clé=     ISTEX:075C2186877896091C7ED5D01E963553ED179886
   |texte=   Recognition of printed arabic text based on global features and decision tree learning techniques
}}

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jun 10 21:56:28 2019. Site generation: Fri Feb 25 15:29:27 2022