Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Rotation-Invariant Features for Multi-Oriented Text Detection in Natural Images

Identifieur interne : 000175 ( Pmc/Curation ); précédent : 000174; suivant : 000176

Rotation-Invariant Features for Multi-Oriented Text Detection in Natural Images

Auteurs : Cong Yao [République populaire de Chine] ; Xin Zhang [République populaire de Chine] ; Xiang Bai [République populaire de Chine] ; Wenyu Liu [République populaire de Chine] ; Yi Ma [République populaire de Chine] ; Zhuowen Tu [République populaire de Chine, États-Unis]

Source :

RBID : PMC:3734103

Abstract

Texts in natural scenes carry rich semantic information, which can be used to assist a wide range of applications, such as object recognition, image/video retrieval, mapping/navigation, and human computer interaction. However, most existing systems are designed to detect and recognize horizontal (or near-horizontal) texts. Due to the increasing popularity of mobile-computing devices and applications, detecting texts of varying orientations from natural images under less controlled conditions has become an important but challenging task. In this paper, we propose a new algorithm to detect texts of varying orientations. Our algorithm is based on a two-level classification scheme and two sets of features specially designed for capturing the intrinsic characteristics of texts. To better evaluate the proposed method and compare it with the competing algorithms, we generate a comprehensive dataset with various types of texts in diverse real-world scenes. We also propose a new evaluation protocol, which is more suitable for benchmarking algorithms for detecting texts in varying orientations. Experiments on benchmark datasets demonstrate that our system compares favorably with the state-of-the-art algorithms when handling horizontal texts and achieves significantly enhanced performance on variant texts in complex natural scenes.


Url:
DOI: 10.1371/journal.pone.0070173
PubMed: 23940544
PubMed Central: 3734103

Links toward previous steps (curation, corpus...)


Links to Exploration step

PMC:3734103

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Rotation-Invariant Features for Multi-Oriented Text Detection in Natural Images</title>
<author>
<name sortKey="Yao, Cong" sort="Yao, Cong" uniqKey="Yao C" first="Cong" last="Yao">Cong Yao</name>
<affiliation wicri:level="1">
<nlm:aff id="aff1">
<addr-line>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Xin" sort="Zhang, Xin" uniqKey="Zhang X" first="Xin" last="Zhang">Xin Zhang</name>
<affiliation wicri:level="1">
<nlm:aff id="aff2">
<addr-line>Department of Computer Science and Technology, Tsinghua University, Beijing, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Department of Computer Science and Technology, Tsinghua University, Beijing</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Bai, Xiang" sort="Bai, Xiang" uniqKey="Bai X" first="Xiang" last="Bai">Xiang Bai</name>
<affiliation wicri:level="1">
<nlm:aff id="aff3">
<addr-line>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Liu, Wenyu" sort="Liu, Wenyu" uniqKey="Liu W" first="Wenyu" last="Liu">Wenyu Liu</name>
<affiliation wicri:level="1">
<nlm:aff id="aff4">
<addr-line>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Ma, Yi" sort="Ma, Yi" uniqKey="Ma Y" first="Yi" last="Ma">Yi Ma</name>
<affiliation wicri:level="1">
<nlm:aff id="aff5">
<addr-line>Microsoft Research Asia, Beijing, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Microsoft Research Asia, Beijing</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Tu, Zhuowen" sort="Tu, Zhuowen" uniqKey="Tu Z" first="Zhuowen" last="Tu">Zhuowen Tu</name>
<affiliation wicri:level="1">
<nlm:aff id="aff5">
<addr-line>Microsoft Research Asia, Beijing, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Microsoft Research Asia, Beijing</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="aff6">
<addr-line>University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>University of California Los Angeles, Los Angeles, California</wicri:regionArea>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PMC</idno>
<idno type="pmid">23940544</idno>
<idno type="pmc">3734103</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3734103</idno>
<idno type="RBID">PMC:3734103</idno>
<idno type="doi">10.1371/journal.pone.0070173</idno>
<date when="2013">2013</date>
<idno type="wicri:Area/Pmc/Corpus">000175</idno>
<idno type="wicri:Area/Pmc/Curation">000175</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a" type="main">Rotation-Invariant Features for Multi-Oriented Text Detection in Natural Images</title>
<author>
<name sortKey="Yao, Cong" sort="Yao, Cong" uniqKey="Yao C" first="Cong" last="Yao">Cong Yao</name>
<affiliation wicri:level="1">
<nlm:aff id="aff1">
<addr-line>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Zhang, Xin" sort="Zhang, Xin" uniqKey="Zhang X" first="Xin" last="Zhang">Xin Zhang</name>
<affiliation wicri:level="1">
<nlm:aff id="aff2">
<addr-line>Department of Computer Science and Technology, Tsinghua University, Beijing, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Department of Computer Science and Technology, Tsinghua University, Beijing</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Bai, Xiang" sort="Bai, Xiang" uniqKey="Bai X" first="Xiang" last="Bai">Xiang Bai</name>
<affiliation wicri:level="1">
<nlm:aff id="aff3">
<addr-line>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Liu, Wenyu" sort="Liu, Wenyu" uniqKey="Liu W" first="Wenyu" last="Liu">Wenyu Liu</name>
<affiliation wicri:level="1">
<nlm:aff id="aff4">
<addr-line>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Ma, Yi" sort="Ma, Yi" uniqKey="Ma Y" first="Yi" last="Ma">Yi Ma</name>
<affiliation wicri:level="1">
<nlm:aff id="aff5">
<addr-line>Microsoft Research Asia, Beijing, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Microsoft Research Asia, Beijing</wicri:regionArea>
</affiliation>
</author>
<author>
<name sortKey="Tu, Zhuowen" sort="Tu, Zhuowen" uniqKey="Tu Z" first="Zhuowen" last="Tu">Zhuowen Tu</name>
<affiliation wicri:level="1">
<nlm:aff id="aff5">
<addr-line>Microsoft Research Asia, Beijing, China</addr-line>
</nlm:aff>
<country xml:lang="fr">République populaire de Chine</country>
<wicri:regionArea>Microsoft Research Asia, Beijing</wicri:regionArea>
</affiliation>
<affiliation wicri:level="1">
<nlm:aff id="aff6">
<addr-line>University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</nlm:aff>
<country xml:lang="fr">États-Unis</country>
<wicri:regionArea>University of California Los Angeles, Los Angeles, California</wicri:regionArea>
</affiliation>
</author>
</analytic>
<series>
<title level="j">PLoS ONE</title>
<idno type="eISSN">1932-6203</idno>
<imprint>
<date when="2013">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">
<p>Texts in natural scenes carry rich semantic information, which can be used to assist a wide range of applications, such as object recognition, image/video retrieval, mapping/navigation, and human computer interaction. However, most existing systems are designed to detect and recognize horizontal (or near-horizontal) texts. Due to the increasing popularity of mobile-computing devices and applications, detecting texts of varying orientations from natural images under less controlled conditions has become an important but challenging task. In this paper, we propose a new algorithm to detect texts of varying orientations. Our algorithm is based on a two-level classification scheme and two sets of features specially designed for capturing the intrinsic characteristics of texts. To better evaluate the proposed method and compare it with the competing algorithms, we generate a comprehensive dataset with various types of texts in diverse real-world scenes. We also propose a new evaluation protocol, which is more suitable for benchmarking algorithms for detecting texts in varying orientations. Experiments on benchmark datasets demonstrate that our system compares favorably with the state-of-the-art algorithms when handling horizontal texts and achieves significantly enhanced performance on variant texts in complex natural scenes.</p>
</div>
</front>
<back>
<div1 type="bibliography">
<listBibl>
<biblStruct>
<analytic>
<author>
<name sortKey="Desouza, Gn" uniqKey="Desouza G">GN DeSouza</name>
</author>
<author>
<name sortKey="Kak, Ac" uniqKey="Kak A">AC Kak</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jain, A" uniqKey="Jain A">A Jain</name>
</author>
<author>
<name sortKey="Yu, B" uniqKey="Yu B">B Yu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hasan, Ymy" uniqKey="Hasan Y">YMY Hasan</name>
</author>
<author>
<name sortKey="Karam, Lj" uniqKey="Karam L">LJ Karam</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Kim, Ki" uniqKey="Kim K">KI Kim</name>
</author>
<author>
<name sortKey="Jung, K" uniqKey="Jung K">K Jung</name>
</author>
<author>
<name sortKey="Kim, Jh" uniqKey="Kim J">JH Kim</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Chen, D" uniqKey="Chen D">D Chen</name>
</author>
<author>
<name sortKey="Odobez, Jm" uniqKey="Odobez J">JM Odobez</name>
</author>
<author>
<name sortKey="Bourlard, H" uniqKey="Bourlard H">H Bourlard</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, M" uniqKey="Zhao M">M Zhao</name>
</author>
<author>
<name sortKey="Li, St" uniqKey="Li S">ST Li</name>
</author>
<author>
<name sortKey="Kwok, J" uniqKey="Kwok J">J Kwok</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Pan, Y" uniqKey="Pan Y">Y Pan</name>
</author>
<author>
<name sortKey="Hou, X" uniqKey="Hou X">X Hou</name>
</author>
<author>
<name sortKey="Liu, C" uniqKey="Liu C">C Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yi, C" uniqKey="Yi C">C Yi</name>
</author>
<author>
<name sortKey="Tian, Y" uniqKey="Tian Y">Y Tian</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Shivakumara, P" uniqKey="Shivakumara P">P Shivakumara</name>
</author>
<author>
<name sortKey="Phan, Tq" uniqKey="Phan T">TQ Phan</name>
</author>
<author>
<name sortKey="Tan, Cl" uniqKey="Tan C">CL Tan</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Bouman, Kl" uniqKey="Bouman K">KL Bouman</name>
</author>
<author>
<name sortKey="Abdollahian, G" uniqKey="Abdollahian G">G Abdollahian</name>
</author>
<author>
<name sortKey="Boutin, M" uniqKey="Boutin M">M Boutin</name>
</author>
<author>
<name sortKey="Delp, Ej" uniqKey="Delp E">EJ Delp</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhao, X" uniqKey="Zhao X">X Zhao</name>
</author>
<author>
<name sortKey="Lin, Kh" uniqKey="Lin K">KH Lin</name>
</author>
<author>
<name sortKey="Fu, Y" uniqKey="Fu Y">Y Fu</name>
</author>
<author>
<name sortKey="Hu, Y" uniqKey="Hu Y">Y Hu</name>
</author>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Yi, C" uniqKey="Yi C">C Yi</name>
</author>
<author>
<name sortKey="Tian, Y" uniqKey="Tian Y">Y Tian</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, X" uniqKey="Liu X">X Liu</name>
</author>
<author>
<name sortKey="Wang, W" uniqKey="Wang W">W Wang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Jung, K" uniqKey="Jung K">K Jung</name>
</author>
<author>
<name sortKey="Kim, K" uniqKey="Kim K">K Kim</name>
</author>
<author>
<name sortKey="Jain, A" uniqKey="Jain A">A Jain</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liang, J" uniqKey="Liang J">J Liang</name>
</author>
<author>
<name sortKey="Doermann, D" uniqKey="Doermann D">D Doermann</name>
</author>
<author>
<name sortKey="Li, H" uniqKey="Li H">H Li</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhong, Y" uniqKey="Zhong Y">Y Zhong</name>
</author>
<author>
<name sortKey="Karu, K" uniqKey="Karu K">K Karu</name>
</author>
<author>
<name sortKey="Jain, Ak" uniqKey="Jain A">AK Jain</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Li, Hp" uniqKey="Li H">HP Li</name>
</author>
<author>
<name sortKey="Doermann, D" uniqKey="Doermann D">D Doermann</name>
</author>
<author>
<name sortKey="Kia, O" uniqKey="Kia O">O Kia</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhong, Y" uniqKey="Zhong Y">Y Zhong</name>
</author>
<author>
<name sortKey="Zhang, H" uniqKey="Zhang H">H Zhang</name>
</author>
<author>
<name sortKey="Jain, Ak" uniqKey="Jain A">AK Jain</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lienhart, R" uniqKey="Lienhart R">R Lienhart</name>
</author>
<author>
<name sortKey="Wernicke, A" uniqKey="Wernicke A">A Wernicke</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Lyu, Mr" uniqKey="Lyu M">MR Lyu</name>
</author>
<author>
<name sortKey="Song, J" uniqKey="Song J">J Song</name>
</author>
<author>
<name sortKey="Cai, M" uniqKey="Cai M">M Cai</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wright, J" uniqKey="Wright J">J Wright</name>
</author>
<author>
<name sortKey="Yang, A" uniqKey="Yang A">A Yang</name>
</author>
<author>
<name sortKey="Ganesh, A" uniqKey="Ganesh A">A Ganesh</name>
</author>
<author>
<name sortKey="Sastry, S" uniqKey="Sastry S">S Sastry</name>
</author>
<author>
<name sortKey="Ma, Y" uniqKey="Ma Y">Y Ma</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Elad, M" uniqKey="Elad M">M Elad</name>
</author>
<author>
<name sortKey="Aharon, M" uniqKey="Aharon M">M Aharon</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Liu, Y" uniqKey="Liu Y">Y Liu</name>
</author>
<author>
<name sortKey="Goto, S" uniqKey="Goto S">S Goto</name>
</author>
<author>
<name sortKey="Ikenaga, T" uniqKey="Ikenaga T">T Ikenaga</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Hua, Xs" uniqKey="Hua X">XS Hua</name>
</author>
<author>
<name sortKey="Liu, W" uniqKey="Liu W">W Liu</name>
</author>
<author>
<name sortKey="Zhang, Hj" uniqKey="Zhang H">HJ Zhang</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Canny, Jf" uniqKey="Canny J">JF Canny</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Breiman, L" uniqKey="Breiman L">L Breiman</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Freeman, H" uniqKey="Freeman H">H Freeman</name>
</author>
<author>
<name sortKey="Shapira, R" uniqKey="Shapira R">R Shapira</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Belongie, S" uniqKey="Belongie S">S Belongie</name>
</author>
<author>
<name sortKey="Malik, J" uniqKey="Malik J">J Malik</name>
</author>
<author>
<name sortKey="Puzicha, J" uniqKey="Puzicha J">J Puzicha</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Peng, Y" uniqKey="Peng Y">Y Peng</name>
</author>
<author>
<name sortKey="Ganesh, A" uniqKey="Ganesh A">A Ganesh</name>
</author>
<author>
<name sortKey="Wright, J" uniqKey="Wright J">J Wright</name>
</author>
<author>
<name sortKey="Xu, W" uniqKey="Xu W">W Xu</name>
</author>
<author>
<name sortKey="Ma, Y" uniqKey="Ma Y">Y Ma</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Everingham, M" uniqKey="Everingham M">M Everingham</name>
</author>
<author>
<name sortKey="Gool, Lv" uniqKey="Gool L">LV Gool</name>
</author>
<author>
<name sortKey="Williams, Cki" uniqKey="Williams C">CKI Williams</name>
</author>
<author>
<name sortKey="Winn, J" uniqKey="Winn J">J Winn</name>
</author>
<author>
<name sortKey="Zisserman, A" uniqKey="Zisserman A">A Zisserman</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Wolf, C" uniqKey="Wolf C">C Wolf</name>
</author>
<author>
<name sortKey="Jolion, Jm" uniqKey="Jolion J">JM Jolion</name>
</author>
</analytic>
</biblStruct>
<biblStruct>
<analytic>
<author>
<name sortKey="Zhang, Z" uniqKey="Zhang Z">Z Zhang</name>
</author>
<author>
<name sortKey="Ganesh, A" uniqKey="Ganesh A">A Ganesh</name>
</author>
<author>
<name sortKey="Liang, X" uniqKey="Liang X">X Liang</name>
</author>
<author>
<name sortKey="Ma, Y" uniqKey="Ma Y">Y Ma</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article">
<pmc-dir>properties open_access</pmc-dir>
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">PLoS One</journal-id>
<journal-id journal-id-type="iso-abbrev">PLoS ONE</journal-id>
<journal-id journal-id-type="publisher-id">plos</journal-id>
<journal-id journal-id-type="pmc">plosone</journal-id>
<journal-title-group>
<journal-title>PLoS ONE</journal-title>
</journal-title-group>
<issn pub-type="epub">1932-6203</issn>
<publisher>
<publisher-name>Public Library of Science</publisher-name>
<publisher-loc>San Francisco, USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmid">23940544</article-id>
<article-id pub-id-type="pmc">3734103</article-id>
<article-id pub-id-type="publisher-id">PONE-D-13-12568</article-id>
<article-id pub-id-type="doi">10.1371/journal.pone.0070173</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v2">
<subject>Computer Science</subject>
<subj-group>
<subject>Algorithms</subject>
</subj-group>
<subj-group>
<subject>Computing Methods</subject>
<subj-group>
<subject>Mathematical Computing</subject>
</subj-group>
</subj-group>
<subj-group>
<subject>Information Technology</subject>
</subj-group>
<subj-group>
<subject>Text Mining</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Rotation-Invariant Features for Multi-Oriented Text Detection in Natural Images</article-title>
<alt-title alt-title-type="running-head">Rotation-Invariant Features for Text Detection</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Yao</surname>
<given-names>Cong</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Zhang</surname>
<given-names>Xin</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Bai</surname>
<given-names>Xiang</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
<xref ref-type="corresp" rid="cor1">
<sup>*</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Liu</surname>
<given-names>Wenyu</given-names>
</name>
<xref ref-type="aff" rid="aff4">
<sup>4</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Ma</surname>
<given-names>Yi</given-names>
</name>
<xref ref-type="aff" rid="aff5">
<sup>5</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Tu</surname>
<given-names>Zhuowen</given-names>
</name>
<xref ref-type="aff" rid="aff5">
<sup>5</sup>
</xref>
<xref ref-type="aff" rid="aff6">
<sup>6</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<label>1</label>
<addr-line>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China</addr-line>
</aff>
<aff id="aff2">
<label>2</label>
<addr-line>Department of Computer Science and Technology, Tsinghua University, Beijing, China</addr-line>
</aff>
<aff id="aff3">
<label>3</label>
<addr-line>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China</addr-line>
</aff>
<aff id="aff4">
<label>4</label>
<addr-line>Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China</addr-line>
</aff>
<aff id="aff5">
<label>5</label>
<addr-line>Microsoft Research Asia, Beijing, China</addr-line>
</aff>
<aff id="aff6">
<label>6</label>
<addr-line>University of California Los Angeles, Los Angeles, California, United States of America</addr-line>
</aff>
<contrib-group>
<contrib contrib-type="editor">
<name>
<surname>Preis</surname>
<given-names>Tobias</given-names>
</name>
<role>Editor</role>
<xref ref-type="aff" rid="edit1"></xref>
</contrib>
</contrib-group>
<aff id="edit1">
<addr-line>University of Warwick, United Kingdom</addr-line>
</aff>
<author-notes>
<corresp id="cor1">* E-mail:
<email>xbai@hust.edu.cn</email>
</corresp>
<fn fn-type="conflict">
<p>
<bold>Competing Interests: </bold>
Two of the authors (YM and ZT) are employed by Microsoft Research Asia, Beijing, China. The authors released a dataset named MSRA-TD500 in this paper. This database is publicly available online and everyone can download it from
<ext-link ext-link-type="uri" xlink:href="http://www.loni.ucla.edu/~ztu/publication/MSRA-TD500.zip">http://www.loni.ucla.edu/~ztu/publication/MSRA-TD500.zip</ext-link>
. Moreover, to the best of the authors’ knowledge, there is currently no related patent or product. The authors confirm that this does not alter their adherence to all the PLOS ONE policies on sharing data and materials.</p>
</fn>
<fn fn-type="con">
<p>Conceived and designed the experiments: CY YM ZT. Performed the experiments: CY XZ XB. Analyzed the data: XB WL. Wrote the paper: CY YM ZT.</p>
</fn>
</author-notes>
<pub-date pub-type="collection">
<year>2013</year>
</pub-date>
<pub-date pub-type="epub">
<day>5</day>
<month>8</month>
<year>2013</year>
</pub-date>
<volume>8</volume>
<issue>8</issue>
<elocation-id>e70173</elocation-id>
<history>
<date date-type="received">
<day>21</day>
<month>3</month>
<year>2013</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>6</month>
<year>2013</year>
</date>
</history>
<permissions>
<copyright-year>2013</copyright-year>
<copyright-holder>Yao et al</copyright-holder>
<license>
<license-p>This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
</license>
</permissions>
<abstract>
<p>Texts in natural scenes carry rich semantic information, which can be used to assist a wide range of applications, such as object recognition, image/video retrieval, mapping/navigation, and human computer interaction. However, most existing systems are designed to detect and recognize horizontal (or near-horizontal) texts. Due to the increasing popularity of mobile-computing devices and applications, detecting texts of varying orientations from natural images under less controlled conditions has become an important but challenging task. In this paper, we propose a new algorithm to detect texts of varying orientations. Our algorithm is based on a two-level classification scheme and two sets of features specially designed for capturing the intrinsic characteristics of texts. To better evaluate the proposed method and compare it with the competing algorithms, we generate a comprehensive dataset with various types of texts in diverse real-world scenes. We also propose a new evaluation protocol, which is more suitable for benchmarking algorithms for detecting texts in varying orientations. Experiments on benchmark datasets demonstrate that our system compares favorably with the state-of-the-art algorithms when handling horizontal texts and achieves significantly enhanced performance on variant texts in complex natural scenes.</p>
</abstract>
<funding-group>
<funding-statement>This work was supported by National Natural Science Foundation of China (grant No. 61173120, 60903096 and 61222308), and Ministry of Education of China project NCET-12-0217. ZT is supported by NSF CAREER award IIS-0844566 and NSF IIS-0844566. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</funding-statement>
</funding-group>
<counts>
<page-count count="15"></page-count>
</counts>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>Texts in a natural scene directly carry critical high-level semantic information. Their existence is also ubiquitous in urban environments, e.g. traffic signs, billboards, business name cards, and license plates. Effective text detection and recognition systems have been very useful in a variety of applications such as robot navigation
<xref ref-type="bibr" rid="pone.0070173-DeSouza1">[1]</xref>
, image search
<xref ref-type="bibr" rid="pone.0070173-Tsai1">[2]</xref>
, and human computer interaction
<xref ref-type="bibr" rid="pone.0070173-Kisacanin1">[3]</xref>
. The popularity of smart phones and ubiquitous computing devices have also made the acquisition and transmission of text data increasingly convenient and efficient. Thus, automatically detecting and recognizing texts from casually captured images has become an ever important task in computer vision.</p>
<p>In this paper, we tackle the problem of text detection in natural images, which remains a challenging task although it has been extensively studied in the past decades
<xref ref-type="bibr" rid="pone.0070173-Jain1">[4]</xref>
<xref ref-type="bibr" rid="pone.0070173-Minetto1">[19]</xref>
. The difficulty of automatic text detection mainly stems from two aspects: (1) diversity of text appearances and (2) complexity of cluttered backgrounds. On one hand, texts, unlike conventional objects (e.g. cars and horses), typically consist of a large number of different instances and they exhibit significant variations in shapes and appearances: different texts may have different sizes, colors, fonts, languages, and orientations, even within the same scene. On the other hand, many other man-made objects (such as windows and railings) in the scene often bear a great deal of similarity to texts. Sometimes even natural objects (such as grasses and leaves) may happen to distribute in a similar way as a sequence of characters. Such ambiguities have made reliable text detection in natural images a challenging task.</p>
<p>In the literature, most of the existing methods
<xref ref-type="bibr" rid="pone.0070173-Kim1">[6]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Wang2">[20]</xref>
have focused on detecting horizontal or near-horizontal texts, as we will see in a survey of related work. Obviously, the requirement of being horizontal severely limits the applicability of those methods in scenarios where images are taken casually with a mobile device. Detecting texts with varying orientations in complex natural scenes remains a challenge for most practical text detection and recognition systems
<xref ref-type="bibr" rid="pone.0070173-ABBYY1">[21]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-GVision1">[22]</xref>
. In this work, we aim to build an effective and efficient system for detecting multi-oriented texts in complex natural scenes (see
<xref ref-type="fig" rid="pone-0070173-g001">Fig. 1</xref>
).</p>
<fig id="pone-0070173-g001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.g001</object-id>
<label>Figure 1</label>
<caption>
<title>Detected texts in natural images by the proposed algorithm.</title>
</caption>
<graphic xlink:href="pone.0070173.g001"></graphic>
</fig>
<p>Most conventional text detection methods rely on features that are primarily designed for horizontal texts (such as those used in
<xref ref-type="bibr" rid="pone.0070173-Chen1">[7]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Pan1">[13]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Minetto1">[19]</xref>
). Thus, when such methods are applied to images that contain multi-oriented texts, their performance usually drops drastically. To remedy this situation, we introduce two additional sets of rotation-invariant features for text detection. To further reduce false positives produced by only using such low-level features, we have also designed a two-level classification scheme that can effectively discriminate texts from non-texts. Hence, by combining the strength of rotation-invariant features and well trained text classifiers, our system is able to effectively detect multi-oriented texts with very few false positives.</p>
<p>The proposed method is mostly bottom-up (data-driven) but with additional prior knowledge about texts imposed in a top-down fashion. Pixels are first grouped into connected components, corresponding to strokes or characters; connected components are then linked together to form chains, corresponding to words or sentences. The connected components and chains are verified by the orientation-invariant features and discriminative classifiers. With this strategy, our method is able to combine the strength of both prior knowledge about texts (such as uniform stroke width) and automatically learned classifiers from labeled training data. In this way, we can strike a good balance between systematic design and machine learning, which is shown to be advantageous over either heavy black-box learning
<xref ref-type="bibr" rid="pone.0070173-Chen1">[7]</xref>
or purely heuristic design
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
.</p>
<p>To evaluate the effectiveness of our system, we have conducted extensive experiments on both conventional benchmarks and some new (more extensive and challenging) datasets. Compared with the state-of-the-art text detection algorithms, our system performs competitively in the conventional setting of horizontal texts. We have also tested our system on a challenging dataset of 500 natural images containing texts of various orientations in complex backgrounds. On this dataset, our system works significantly better than the existing systems, with an F-measure about 0.6, more than twice that of the closest competitor.</p>
<p>We have presented a preliminary version of our work in
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
. This paper extends that article with the following contributions: (1) some steps of the algorithm are improved. Specifically, the case of detecting single characters, which is heavily neglected by existing methods, is discussed; (2) further evaluations, including text detection experiments on the dataset of the latest ICDAR robust reading competition (ICDAR 2011) and on texts of different languages, are conducted; (3) an end-to-end multi-oriented scene text recognition system, integrating the proposed text detection algorithm with an off-the-shelf OCR engine, is introduced; (4) the proposed evaluation protocol is detailed; (5) more technical details of the proposed method are presented and (6) comprehensive discussions and analyses are given.</p>
<sec id="s1a">
<title>Related Work</title>
<p>There have been a large number of systems dealing with text detection in natural images and videos
<xref ref-type="bibr" rid="pone.0070173-Jain1">[4]</xref>
<xref ref-type="bibr" rid="pone.0070173-Neumann2">[18]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Gllavata1">[24]</xref>
<xref ref-type="bibr" rid="pone.0070173-Liu1">[28]</xref>
. Comprehensive surveys can be found in
<xref ref-type="bibr" rid="pone.0070173-Jung1">[29]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Liang1">[30]</xref>
. Existing approaches to text detection can be roughly divided into three categories: texture-based, component-based, and hybrid methods.</p>
<sec id="s1a1">
<title>Three categories of existing approaches</title>
<p>
<bold>Texture-based methods</bold>
(e.g.
<xref ref-type="bibr" rid="pone.0070173-Kim1">[6]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Chen1">[7]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Gllavata1">[24]</xref>
) treat text as a special type of texture and make use of its textural properties, such as local intensities, spatial variance, filter responses, and wavelet coefficients. Generally, these methods are computation demanding as all locations and scales are exhaustively scanned. Moreover, these algorithms mostly only detect horizontal texts.</p>
<p>In an early work, Zhong
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Zhong1">[31]</xref>
proposed a method for text localization in color images. Horizontal spatial variance was used to roughly localize texts and color segmentation was performed within the localized areas to extract text components. The system of Wu
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Wu1">[32]</xref>
adopted a set of Gaussian derivatives to segment texts. Rectangular boxes surrounding the corresponding text strings were formed, based on certain heuristic rules on text strings, such as height similarity, spacing and alignment. The above steps were applied to an image pyramid and the results were fused to make final detections. Li
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Li1">[33]</xref>
presented a system for detecting and tracking texts in digital video. In this system, the mean and the second- and third-order central moments of wavelet decomposition responses are used as local features. Zhong
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Zhong2">[34]</xref>
proposed to localize candidate caption text regions directly in the discrete cosine transform (DCT) compressed domain using the intensity variation information encoded in the DCT domain. The method proposed by Gllavata
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Gllavata1">[24]</xref>
utilized the distribution of high-frequency wavelet coefficients to statistically characterize text and non-text areas.</p>
<p>Different from the methods surveyed above, in which filter responses or transform domain coefficients are used as features, the algorithm of Kim
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Kim1">[6]</xref>
relies merely on intensities of raw pixels. A Support Vector Machine (SVM) classifier is trained to generate probability maps, in which the positions and extents of texts are searched using adaptive mean shift. Lienhart and Wernicke
<xref ref-type="bibr" rid="pone.0070173-Lienhart1">[35]</xref>
used complex-valued edge orientation maps computed from the original RGB image as features and trained neural network to distinguish between text and non-text patterns.</p>
<p>The method of Weinman
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Weinman1">[36]</xref>
used a rich representation that captures important relationships between responses to different scale- and orientation-selective filters. To improve the performance, conditional random field (CRF) was used to exploit the dependencies between neighboring image region labels. Based on the observation that areas with high edge density indicate text regions, text detection in
<xref ref-type="bibr" rid="pone.0070173-Lyu1">[37]</xref>
was carried out in a sequential multi-resolution paradigm.</p>
<p>To speed up text detection, Chen
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Chen1">[7]</xref>
proposed an efficient text detector, which is a cascade Adaboost classifier. The weak classifiers are trained on a set of informative features, including mean and variance of intensity, horizontal and vertical derivatives, and histograms of intensity gradient. Recently, Wang
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Wang1">[10]</xref>
present a method for spotting words in natural images. They first perform character detection for every letter in an alphabet and then evaluate the configuration scores for the words in a specified list to pick out the most probable one.</p>
<p>
<bold>Component-based methods</bold>
(e.g.
<xref ref-type="bibr" rid="pone.0070173-Jain1">[4]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Ikica1">[38]</xref>
) first extract candidate text components through various ways (e.g. color reduction
<xref ref-type="bibr" rid="pone.0070173-Jain1">[4]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
and Maximally Stable Extremal Region detection
<xref ref-type="bibr" rid="pone.0070173-Neumann1">[11]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Chen3">[25]</xref>
) and then eliminate non-text components using heuristic rules or trained classifier, based on geometry and appearance properties. Component-based methods are usually more efficient than texture-based methods because the number of candidate components is relatively small. These methods are more robust to the variations of texts, such as changes of font, scale and orientation. Moreover, the detected text components can be directly used for character recognition. Due to these advantages, recent progresses in text detection and recognition in natural images have been largely advanced by this category of methods
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Neumann1">[11]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Neumann2">[18]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Ikica1">[38]</xref>
<xref ref-type="bibr" rid="pone.0070173-Neumann3">[40]</xref>
.</p>
<p>In
<xref ref-type="bibr" rid="pone.0070173-Jain1">[4]</xref>
, color reduction and multi-valued image decomposition are performed to partition the input image into multiple foreground components. Connected component analysis is applied to these foreground components, followed by a text identification module, to filter out non-text components.</p>
<p>The great success of sparse representation in face recognition
<xref ref-type="bibr" rid="pone.0070173-Wright1">[41]</xref>
and image denoising
<xref ref-type="bibr" rid="pone.0070173-Elad1">[42]</xref>
has inspired numerous researchers in the community. The authors of
<xref ref-type="bibr" rid="pone.0070173-Pan2">[43]</xref>
and
<xref ref-type="bibr" rid="pone.0070173-Zhao1">[12]</xref>
apply classification procedure to candidate text components, using learned discriminative dictionaries.</p>
<p>The MSER-based methods
<xref ref-type="bibr" rid="pone.0070173-Neumann1">[11]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Neumann2">[18]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Chen3">[25]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Neumann3">[40]</xref>
have attracted much attention from the community, because of the excellent characteristics of MSERs (Maximally Stable Extremal Regions)
<xref ref-type="bibr" rid="pone.0070173-Matas1">[44]</xref>
. MSERs can be computed efficiently (near linear complexity) and are robust to noise and affine transformation. In
<xref ref-type="bibr" rid="pone.0070173-Neumann1">[11]</xref>
, MSERs are detected and taken as candidate text components. Neumann
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Neumann3">[40]</xref>
modified the original MSER algorithm to take region topology into consideration, leading to superior detection performance. Chen
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Chen3">[25]</xref>
also proposed an extension to MSER, in which the boundaries of MSERs are enhanced via edge detection, to cope with image blur. Recently, Neumann
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Neumann2">[18]</xref>
further extend the work of
<xref ref-type="bibr" rid="pone.0070173-Neumann1">[11]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Neumann3">[40]</xref>
to achieve real-time text detection and recognition.</p>
<p>Epshtein
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
proposed a novel image operator, called Stroke Width Transform (SWT), which transforms the image data from containing color values per pixel to containing the most likely stroke width. Based on SWT and a set of heuristic rules, this algorithm can reliably detect horizontal texts.</p>
<p>While most existing algorithms are designed for horizontal or near-horizontal texts, Yi
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
and Shivakumara
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Shivakumara1">[16]</xref>
consider the problem of detecting multi-oriented texts in images or video frames. After extracting candidate components using gradient and color based partition, the line grouping strategy in
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
aggregates the components into text strings. The text strings can be in any direction. However, the method of
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
relies on a large set of manually defined rules and thresholds. In
<xref ref-type="bibr" rid="pone.0070173-Shivakumara1">[16]</xref>
, candidate text component clusters are identified by
<italic>K</italic>
-means clustering in the Fourier-Laplacian domain. The component clusters are divided into separate components using skeletonization. Even though this method can handle multi-oriented texts, it only detects text blocks, rather than characters, words or sentences.</p>
<p>Finally,
<bold>hybrid methods</bold>
(e.g.
<xref ref-type="bibr" rid="pone.0070173-Pan1">[13]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Liu2">[45]</xref>
) are a mixture of texture-based and component-based methods. In
<xref ref-type="bibr" rid="pone.0070173-Liu2">[45]</xref>
, edge pixels of all possible text regions are extracted, using an elaborate edge detection method; the gradient and geometrical properties of region contours are verified to generate candidate text regions, followed by a texture analysis procedure to distinguish true text regions from non-text regions. Unlike
<xref ref-type="bibr" rid="pone.0070173-Liu2">[45]</xref>
, the hybrid method proposed by Pan
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Pan1">[13]</xref>
extracts candidate components from probability maps at multiple scales. The probability maps are estimated by a classifier, which is trained using a set of texture features (HOG features
<xref ref-type="bibr" rid="pone.0070173-Dalal1">[46]</xref>
) computed in predefined patterns. Like most other algorithms, these two methods only detect horizontal texts.</p>
</sec>
</sec>
<sec id="s1b">
<title>Our Strategy</title>
<p>We have drawn two observations about the current text detection algorithms: (1) methods that are purely based on learning (nearly black-box)
<xref ref-type="bibr" rid="pone.0070173-Chen1">[7]</xref>
by training classifiers on a large amount of data can reach certain but limited level of success (system
<xref ref-type="bibr" rid="pone.0070173-Chen1">[7]</xref>
obtained from the authors produces reasonable results on horizontal English texts but has poor performance in general cases); (2) systems that are based on smart features, such as Stroke Width Transform (SWT)
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
, are robust to variations of texts but they involve a lot of tuning and are still far from producing all satisfactory results, especially for non-horizontal texts.</p>
<p>In this paper, we adopt SWT and also design various new features that are intrinsic to texts and robust to variations (such as rotation and scale change); a two-level classification scheme is devised to moderately utilize training to remove sensitive parameter tuning by hand. We observe significant improvement over the existing approaches in dealing with real-world scenes.</p>
<p>Though widely used in the community, the ICDAR datasets
<xref ref-type="bibr" rid="pone.0070173-Lucas1">[47]</xref>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
only contain horizontal English texts. In
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
, a dataset with texts of different directions is released, but it includes only 89 images without enough diversity in the texts and backgrounds. Here we collect a new dataset with 500 images of indoor and outdoor scenes. In addition, the evaluation methods used in
<xref ref-type="bibr" rid="pone.0070173-Hua1">[50]</xref>
and the ICDAR competitions
<xref ref-type="bibr" rid="pone.0070173-Lucas1">[47]</xref>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
are mainly designed for horizontal texts. Hence, we propose a new protocol that is more suitable for assessing algorithms developed for multi-oriented texts.</p>
</sec>
<sec id="s1c">
<title>Proposed Approach</title>
<p>The proposed algorithm consists of four stages: (1) component extraction, (2) component analysis, (3) candidate linking, and (4) chain analysis, which can be further categorized into two procedures, bottom-up grouping and top-down pruning, as shown in
<xref ref-type="fig" rid="pone-0070173-g002">Fig. 2</xref>
. In the bottom-up grouping procedure, pixels are first grouped into connected components and later these connected components are aggregated to form chains; in the top-down pruning procedure non-text components and chains are successively identified and eliminated. The two procedures are applied alternately.</p>
<fig id="pone-0070173-g002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.g002</object-id>
<label>Figure 2</label>
<caption>
<title>Pipeline of the proposed approach.</title>
</caption>
<graphic xlink:href="pone.0070173.g002"></graphic>
</fig>
<sec id="s1c1">
<title>Component extraction</title>
<p>At this stage, edge detection is performed on the original image and the edge map is fed to the SWT
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
module to produce an SWT image. Neighboring pixels in the SWT image are grouped together recursively to form connected components using a simple association rule.</p>
</sec>
<sec id="s1c2">
<title>Component analysis</title>
<p>Many components extracted at the component extraction stage are not parts of texts. The component analysis stage is aimed to identify and filter out those non-text components. First, the components are filtered using a set of heuristic rules that can distinguish between obvious spurious text regions and true text regions. Next, a component level classifier is applied to prune the non-text components that are hard for the simple filter.</p>
</sec>
<sec id="s1c3">
<title>Candidate linking</title>
<p>The remaining components are taken as character candidates. In fact, components do not necessarily correspond to characters, because a single character in some languages may consist of several strokes; however, we still call them characters (or character candidates) hereafter for simplicity. The first step of the candidate linking stage is to link the character candidates into pairs. Two adjacent candidates are grouped into a pair if they have similar geometric properties and colors. At the next step, the candidate pairs are aggregated into chains in a recursive manner.</p>
</sec>
<sec id="s1c4">
<title>Chain analysis</title>
<p>At the chain analysis stage, the chains determined at the former stage are verified by a chain level classifier. The chains with low classification scores (probabilities) are discarded. The chains may be in any direction, so a candidate might belong to multiple chains; the interpretation step is aimed to dispel this ambiguity. The chains that pass this stage are the final detected texts.</p>
<p>The remainder of this paper is organized as follows. Section
<bold>Methodology</bold>
presents the details of the proposed method, including the algorithm pipeline and the two sets of features. Section
<bold>Dataset and Evaluation Protocol</bold>
introduces the proposed dataset and evaluation protocol. The experimental results and discussions are given in Section
<bold>Experiments and Discussions</bold>
. Section
<bold>Conclusions</bold>
concludes the paper and points out potential directions for future research.</p>
</sec>
</sec>
</sec>
<sec sec-type="methods" id="s2">
<title>Methodology</title>
<p>In this section, we present the details of the proposed algorithm. Specifically, the pipeline of the algorithm will be presented in Section
<bold>Algorithm Pipeline</bold>
and the details of the features will be described in Section
<bold>Feature Design</bold>
.</p>
<sec id="s2a">
<title>Algorithm Pipeline</title>
<sec id="s2a1">
<title>Component extraction</title>
<p>To extract connected components from the image, SWT
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
is adopted for its effectiveness and efficiency. SWT is an image operator which computes per pixel width of the most likely stroke containing the pixel. It provides a way to discover connected components from edge map directly, which makes it unnecessary to consider the factors of scale and direction. See
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
for details.</p>
<p>SWT runs on edge map, so we use Canny edge detector
<xref ref-type="bibr" rid="pone.0070173-Canny1">[51]</xref>
to produce an edge map (
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(b) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
) from the original image (
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(a) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
). The resulting SWT image is shown in
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(c) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
.</p>
<fig id="pone-0070173-g003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.g003</object-id>
<label>Figure 3</label>
<caption>
<title>Typical images from the proposed dataset along with ground truth rectangles.</title>
<p>Notice the red rectangles. They indicate the texts within them are labeled as difficult (due to blur or occlusion).</p>
</caption>
<graphic xlink:href="pone.0070173.g003"></graphic>
</fig>
<p>The next step of this stage is to group the pixels in the SWT image into connected components. The pixels are associated using a simple rule that the ratio of SWT values of neighboring pixels is less than 3.0. The connected components are shown in
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(d) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
. Note the red rectangles in the image, where each rectangle contains a connected component.</p>
<p>In fact, the proposed pipeline is general and not specific to any kind of low level operator for component extraction. Though SWT is employed to extract components in this paper, other methods (such as MSER
<xref ref-type="bibr" rid="pone.0070173-Neumann1">[11]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Chen3">[25]</xref>
) that are able to reliably generate connected components corresponding to character candidates can also be used. We leave evaluation and comparison of different component extraction methods for future research.</p>
</sec>
<sec id="s2a2">
<title>Component analysis</title>
<p>The purpose of component analysis is to identify and eliminate the connected components that are unlikely parts of texts. To this end, we devise a two-layer filtering mechanism.</p>
<p>The first layer is a filter consists of a set of heuristic rules. This filter runs on a collection of statistical and geometric properties of components, which are very fast to compute. True text components usually have nearly constant stroke width and compact structure (not too thin and long), so width variation, aspect ratio and occupation ratio are chosen as the basic properties to filer out obvious non-text components.</p>
<p>For a connected component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e001.jpg"></inline-graphic>
</inline-formula>
with
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e002.jpg"></inline-graphic>
</inline-formula>
foreground pixels (black pixels in the SWT image), we first compute its bounding box
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e003.jpg"></inline-graphic>
</inline-formula>
(its width and height are denoted by
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e004.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e005.jpg"></inline-graphic>
</inline-formula>
, respectively) and the mean as well as standard deviation of the stroke widths,
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e006.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e007.jpg"></inline-graphic>
</inline-formula>
. The basic properties are defined as follows:</p>
<list list-type="bullet">
<list-item>
<p>
<bold>Width variation.</bold>
Width variation measures the variation in stroke width of the component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e008.jpg"></inline-graphic>
</inline-formula>
:
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e009.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>
<bold>Aspect ratio.</bold>
In horizontal conditions aspect ratio is defined as the ratio between the width and height of the component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e010.jpg"></inline-graphic>
</inline-formula>
. To accommodate texts of different directions, we use a new definition of aspect ratio:
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e011.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>
<bold>Occupation ratio.</bold>
Occupation ratio is used to remove non-text components caused by spurious rays in the SWT image. This property is defined as the ratio between the number of foreground pixels and area of the component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e012.jpg"></inline-graphic>
</inline-formula>
:
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e013.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
</list>
<p>The valid ranges of these basic properties are empirically set to [0,1], [0.1,1] and [0.1,1], respectively. Components with one or more invalid properties will be taken as non-text regions and discarded. A large portion of obvious non-text components are eliminated after this step (notice the difference between
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(d) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
and
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(e) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
), suggesting that this preliminary filter is effective.</p>
<p>The second layer is a classifier trained to identify and reject the non-text components that are hard to remove with the preliminary filter. A collection of component level features, which capture the differences of geometric and textural properties between text components and non-text components, are used to train this classifier. The criteria for feature design are: scale invariance, rotation invariance and low computational cost. To meet these criteria, we propose to estimate the center, characteristic scale and major orientation of each component (
<xref ref-type="fig" rid="pone-0070173-g004">Fig. 4</xref>
of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
) before computing the component level features. Based on these characteristics, features that are both effective and computationally efficient can be obtained. The details of these component level features are discussed in Section
<bold>Component Level Features</bold>
.</p>
<fig id="pone-0070173-g004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.g004</object-id>
<label>Figure 4</label>
<caption>
<title>Ground truth generation.</title>
<p>(a) Human annotation. The annotators are required to locate and bound each text line using a four-vertex polygon (red dots and yellow lines). (b) Ground truth rectangle (green). The ground truth rectangle is generated automatically by fitting a minimum area rectangle using the polygon.</p>
</caption>
<graphic xlink:href="pone.0070173.g004"></graphic>
</fig>
<p>For a component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e014.jpg"></inline-graphic>
</inline-formula>
, the barycenter
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e015.jpg"></inline-graphic>
</inline-formula>
, major axis
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e016.jpg"></inline-graphic>
</inline-formula>
, minor axis
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e017.jpg"></inline-graphic>
</inline-formula>
, and orientation
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e018.jpg"></inline-graphic>
</inline-formula>
are estimated using Camshift
<xref ref-type="bibr" rid="pone.0070173-Bradski1">[52]</xref>
by taking the SWT image of component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e019.jpg"></inline-graphic>
</inline-formula>
as distribution map. The center, characteristic scale and major orientation of component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e020.jpg"></inline-graphic>
</inline-formula>
are defined as:
<disp-formula id="pone.0070173.e021">
<graphic xlink:href="pone.0070173.e021"></graphic>
<label>(1)</label>
</disp-formula>
<disp-formula id="pone.0070173.e022">
<graphic xlink:href="pone.0070173.e022"></graphic>
<label>(2)</label>
</disp-formula>
<disp-formula id="pone.0070173.e023">
<graphic xlink:href="pone.0070173.e023"></graphic>
<label>(3)</label>
</disp-formula>
</p>
<p>These characteristics are invariant to translation, scale and rotation to some degree (
<xref ref-type="fig" rid="pone-0070173-g004">Fig. 4</xref>
of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
). As we will explain in Section
<bold>Component Level Features</bold>
, this is the key to the scale and rotation invariance of the component level features.</p>
<p>We train a component level classifier using the component level features. Random Forest
<xref ref-type="bibr" rid="pone.0070173-Breiman1">[53]</xref>
is chosen as the strong classifier. The component level classifier is the first level of the two-level classification scheme. The probability of component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e024.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e025.jpg"></inline-graphic>
</inline-formula>
, is the fraction of votes for the positive class (text) from the trees. The components whose probabilities are lower than a threshold
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e026.jpg"></inline-graphic>
</inline-formula>
are eliminated and the remaining components are considered as character candidates (
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(f) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
). To ensure high recall, the threshold
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e027.jpg"></inline-graphic>
</inline-formula>
is set very low, as high threshold may filter out true text components.</p>
</sec>
<sec id="s2a3">
<title>Candidate linking</title>
<p>The character candidates are aggregated into chains at this stage. This stage also serves as a filtering step because the candidate characters cannot be linked into chains are taken as components accidentally formed by noises or background clutters, and thus are discarded.</p>
<p>Firstly, character candidates are linked into pairs. In
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
, whether two candidates can be linked into a pair is determined based on the heights and widths of their bounding boxes. However, bounding boxes are not rotation invariant, so we use their characteristic scales instead. If two candidates have similar stroke widths (ratio between the mean stroke widths is less than 2), similar sizes (ratio between their characteristic scales does not exceed 2.5), similar colors and are close enough (distance between them is less than two times the sum of their characteristic scales), they are labeled as a pair. The above parameters are optimized using the training data of the ICDAR datasets
<xref ref-type="bibr" rid="pone.0070173-Lucas1">[47]</xref>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
, however, this parameter setting turns out to be effective for all the datasets used in this paper.</p>
<p>Unlike
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
and
<xref ref-type="bibr" rid="pone.0070173-Neumann1">[11]</xref>
, which only consider horizontal or near-horizontal linkings, the proposed algorithm allows linkings of arbitrary directions. This endows the system with the ability of detecting multi-oriented texts, not limited to horizontal texts.</p>
<p>Next, a greedy hierarchical agglomerative clustering
<xref ref-type="bibr" rid="pone.0070173-Hastie1">[54]</xref>
method is applied to aggregate the pairs into candidate chains. Initially, each pair constitutes a chain. Then the similarity between each couple of chains that share at least one common candidate and have similar orientations is computed; chains with the highest similarity are merged together to form a new chain. The orientation consistency
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e028.jpg"></inline-graphic>
</inline-formula>
and population consistency
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e029.jpg"></inline-graphic>
</inline-formula>
between two chains
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e030.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e031.jpg"></inline-graphic>
</inline-formula>
, which share at least one common candidate, are defined as:
<disp-formula id="pone.0070173.e032">
<graphic xlink:href="pone.0070173.e032"></graphic>
<label>(4)</label>
</disp-formula>
and
<disp-formula id="pone.0070173.e033">
<graphic xlink:href="pone.0070173.e033"></graphic>
<label>(5)</label>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e034.jpg"></inline-graphic>
</inline-formula>
is the included angle of
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e035.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e036.jpg"></inline-graphic>
</inline-formula>
while
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e037.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e038.jpg"></inline-graphic>
</inline-formula>
are the candidate numbers of
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e039.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e040.jpg"></inline-graphic>
</inline-formula>
.
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e041.jpg"></inline-graphic>
</inline-formula>
is used to judge whether two chains have similar orientations and is empirically set to
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e042.jpg"></inline-graphic>
</inline-formula>
. The similarity between two chains
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e043.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e044.jpg"></inline-graphic>
</inline-formula>
is defined as the harmonic mean
<xref ref-type="bibr" rid="pone.0070173-Rijsbergen1">[55]</xref>
of their orientation consistency and population consistency:</p>
<p>
<disp-formula id="pone.0070173.e045">
<graphic xlink:href="pone.0070173.e045"></graphic>
<label>(6)</label>
</disp-formula>
According to this similarity definition, the chains with proximal sizes and orientations are merged with priority. This merging process proceeds until no chains can be merged.</p>
<p>At last, the character candidates not belonging to any chain are discarded. The candidate chains after aggregation are shown in
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(g) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
, in which each green line represents a chain.</p>
</sec>
<sec id="s2a4">
<title>Chain analysis</title>
<p>The candidate chains formed at the previous stage might include false positives that are random combinations of scattered background clutters (such as leaves and grasses) and repeated patterns (such as bricks and windows). To eliminate these false positives, a chain level classifier is trained using the chain level features.</p>
<p>Random Forest
<xref ref-type="bibr" rid="pone.0070173-Breiman1">[53]</xref>
is again used. The chain level classifier is the second level of the two-level classification scheme. The probability of chain
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e046.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e047.jpg"></inline-graphic>
</inline-formula>
, is the fraction of votes for the positive class (text) from the trees. The chains with probabilities lower than a threshold
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e048.jpg"></inline-graphic>
</inline-formula>
are eliminated.</p>
<p>To make better decisions, the total probability of each chain is also calculated. For a chain
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e049.jpg"></inline-graphic>
</inline-formula>
with
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e050.jpg"></inline-graphic>
</inline-formula>
candidates
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e051.jpg"></inline-graphic>
</inline-formula>
, the total probability is defined as:
<disp-formula id="pone.0070173.e052">
<graphic xlink:href="pone.0070173.e052"></graphic>
<label>(7)</label>
</disp-formula>
</p>
<p>The chains whose total probabilities are lower than a threshold
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e053.jpg"></inline-graphic>
</inline-formula>
are discarded.</p>
<p>As texts of different orientations are considered, the remaining chains may be in any direction. Therefore, a candidate might belong to multiple chains. For example, in
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(h) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
the character ‘ P’ in the first line is linked in three chains (note the green lines). In reality, however, a character is unlikely to belong to multiple text lines. If several chains compete for the same candidate, only the chain with the highest total probability will survive (note the difference between
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(h) and (i) in
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
).</p>
<p>The survived chains are outputted by the system as detected texts (
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
(j) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
). For each detected text, its orientation is calculated through linear least squares
<xref ref-type="bibr" rid="pone.0070173-Hastie1">[54]</xref>
using the centers of the characters; its minimum area rectangle
<xref ref-type="bibr" rid="pone.0070173-Freeman1">[56]</xref>
is estimated using the orientation and the bounding boxes of the characters. Word partition, which divides text lines into separate words, is also implemented in the proposed algorithm; but it is not shown, since the general task of text detection does not require this step.</p>
<p>The whole algorithm described above is performed twice to handle both bright text on dark background and dark text on bright background, once along the gradient direction and once along the inverse direction. The results of two passes are fused to make final decisions. For clarity, only the results of one pass are presented.</p>
</sec>
</sec>
<sec id="s2b">
<title>Feature Design</title>
<p>We design two collections of features, component level features and chain level features, for classifying text and non-text, based on the observation that it is the median degree of regularities of text rather than particular color or shape that distinguish it from non-text, which usually has either low degree (random clutters) or high degree (repeated patterns) of regularities. At character level, the regularities of text come from nearly constant width and texturelessness of strokes, and piecewise smoothness of stroke boundaries; at line level, the regularities of text are similar colors, sizes, orientations and structures of characters, and nearly constant spacing between consecutive characters.</p>
<sec id="s2b1">
<title>Component level features</title>
<p>Inspired by Shape Context
<xref ref-type="bibr" rid="pone.0070173-Belongie1">[57]</xref>
and Feature Context
<xref ref-type="bibr" rid="pone.0070173-Wang3">[58]</xref>
, we devise two templates (
<xref ref-type="fig" rid="pone-0070173-g005">Fig. 5</xref>
(a) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
) to capture the regularities of each component in coarse and fine granularity, respectively. The radius and orientation of the templates are not stationary, but adaptive to the component. When computing descriptors for a component, each template is placed at the center and rotated to align with the major orientation of the component; the radius is set to the characteristic scale of the component. Different cues from the sectors are encoded and concatenated into histograms. In this paper, the following cues are considered for each sector:</p>
<fig id="pone-0070173-g005" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.g005</object-id>
<label>Figure 5</label>
<caption>
<title>Calculation of overlap ratio between detection rectangle and ground truth rectangle.</title>
</caption>
<graphic xlink:href="pone.0070173.g005"></graphic>
</fig>
<list list-type="bullet">
<list-item>
<p>
<bold>Contour shape </bold>
<xref ref-type="bibr" rid="pone.0070173-Gu1">[
<bold>59</bold>
]</xref>
<bold>.</bold>
Contour shape is a histogram of oriented gradients. The gradients are computed on the component contour (
<xref ref-type="fig" rid="pone-0070173-g005">Fig. 5</xref>
(c) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
).</p>
</list-item>
<list-item>
<p>
<bold>Edge shape </bold>
<xref ref-type="bibr" rid="pone.0070173-Gu1">[
<bold>59</bold>
]</xref>
<bold>.</bold>
Edge shape is also a histogram of oriented gradients; but the gradients are computed at all the pixels in the sector (
<xref ref-type="fig" rid="pone-0070173-g005">Fig. 5</xref>
(d) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
).</p>
</list-item>
<list-item>
<p>
<bold>Occupation ratio.</bold>
Occupation ratio is defined as the ratio between the number of the foreground pixels of the component within the sector and the sector area (
<xref ref-type="fig" rid="pone-0070173-g005">Fig. 5</xref>
(e) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
).</p>
</list-item>
</list>
<p>To achieve rotation invariance, the gradient orientations are rotated by an angle
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e054.jpg"></inline-graphic>
</inline-formula>
, before computing contour shape and edge shape. Then, the gradient orientations are normalized to the range
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e055.jpg"></inline-graphic>
</inline-formula>
. Six orientation bins are used for computing histograms of contour shape and edge shape, to cope with different fonts and local deformations.</p>
<p>For each cue, the signals computed in all the sectors of all the templates are concatenated to form a descriptor. We call these descriptors scalable rotative descriptors, because they are computed based on templates that are scalable and rotative. Scalable rotative descriptors are similar to PHOG
<xref ref-type="bibr" rid="pone.0070173-Bosch1">[60]</xref>
, as they both adopt spatial pyramid representation
<xref ref-type="bibr" rid="pone.0070173-Lazebnik1">[61]</xref>
. Different from the templates used for computing PHOG, our templates are circular and their scale and orientation are adaptive to the component being described. This is the key to the scale and rotation invariance of these descriptors.</p>
<p>It is widely accepted in the community that alignment is very important for recognition and classification tasks
<xref ref-type="bibr" rid="pone.0070173-Peng1">[62]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Ruiz1">[63]</xref>
, as it can moderate or even eliminate the negative effects caused by transformations and thus lead to more robust measurement and similarity. Our strategy for computing scalable rotative descriptors, i.e. estimating center, characteristic scale and major orientation of components and calculating features using adaptive templates, is actually a kind of implicit alignment of components. This strategy can be generalized to multi-oriented text recognition either.</p>
<p>The characteristic scale is crucial for the computation of scalable rotative descriptors, because it directly determines the scales of the templates. Too small templates may miss important information of components while too large templates may introduce noises and interferences from other components and background. The value of characteristic scale calculated using
<xref ref-type="disp-formula" rid="pone.0070173.e022">Eqn. 2</xref>
is a good trade-off in practice.</p>
<p>We have found through experiments (not shown in this paper) that using finer templates can slightly improve the performance, but will largely increase the computational burden.</p>
<p>In addition, another three types of rotation and scale invariant features are considered:</p>
<list list-type="bullet">
<list-item>
<p>
<bold>Axial ratio.</bold>
Axial ratio is computed by dividing the major axis of the component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e056.jpg"></inline-graphic>
</inline-formula>
with its minor axis:
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e057.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>
<bold>Width variation.</bold>
This feature is the same as defined in Sec.
<bold>Component Analysis</bold>
.</p>
</list-item>
<list-item>
<p>
<bold>Density.</bold>
The density of component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e058.jpg"></inline-graphic>
</inline-formula>
is defined as the ratio between its pixel number
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e059.jpg"></inline-graphic>
</inline-formula>
and characteristic area (here the characteristic area is
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e060.jpg"></inline-graphic>
</inline-formula>
, not the area of the bounding box):
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e061.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
</list>
</sec>
<sec id="s2b2">
<title>Chain level features</title>
<p>Eleven types of chain level features, which are not specific to rotation and scale, are designed to discriminate text lines from false positives (mostly repeated patterns and random clutters) that cannot be distinguished by the component level features.</p>
<p>For a candidate chain
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e062.jpg"></inline-graphic>
</inline-formula>
with
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e063.jpg"></inline-graphic>
</inline-formula>
(
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e064.jpg"></inline-graphic>
</inline-formula>
) candidates
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e065.jpg"></inline-graphic>
</inline-formula>
, the features are defined as below and summarized in
<xref ref-type="table" rid="pone-0070173-t001">Tab. 1:</xref>
</p>
<table-wrap id="pone-0070173-t001" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.t001</object-id>
<label>Table 1</label>
<caption>
<title>Chain level features.</title>
</caption>
<alternatives>
<graphic id="pone-0070173-t001-1" xlink:href="pone.0070173.t001"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Feature</td>
<td align="left" rowspan="1" colspan="1">Definition</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Candidate count</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e066.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Average probability</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e067.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Average turning angle</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e068.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Size variation</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e069.jpg"></inline-graphic>
</inline-formula>
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e070.jpg"></inline-graphic>
</inline-formula>
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e071.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Distance variation</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e072.jpg"></inline-graphic>
</inline-formula>
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e073.jpg"></inline-graphic>
</inline-formula>
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e074.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Average direction bias</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e075.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Average axial ratio</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e076.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Average density</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e077.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Average width variation</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e078.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Average color self-similarity</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e079.jpg"></inline-graphic>
</inline-formula>
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e080.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Average structure self-similarity</td>
<td align="left" rowspan="1" colspan="1">
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e081.jpg"></inline-graphic>
</inline-formula>
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e082.jpg"></inline-graphic>
</inline-formula>
</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<list list-type="bullet">
<list-item>
<p>
<bold>Candidate count.</bold>
This feature is adopted based on the observation that false positives usually have very few (for random clutters) or too many (for repeated patterns) candidates.</p>
</list-item>
<list-item>
<p>
<bold>Average probability.</bold>
The probabilities given by the component level classifier are reliable. This feature is the average of all the probabilities (
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e083.jpg"></inline-graphic>
</inline-formula>
) of the candidates belonging to
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e084.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>
<bold>Average turning angle.</bold>
Most texts present in linear form, so for a text line the mean of the turning angles at the interior characters (
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e085.jpg"></inline-graphic>
</inline-formula>
) is very small; however, for random clutters this property will not hold.
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e086.jpg"></inline-graphic>
</inline-formula>
is the included angle between the line
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e087.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e088.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>
<bold>Size variation.</bold>
In most cases characters in a text line have approximately equal sizes; but it’s not that case for random clutters. The size of each component is measured by its characteristic scale
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e089.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>
<bold>Distance variation.</bold>
Another property of text is that characters in a text line are distributed uniformly, i.e. the distances between consecutive characters have small deviation. The distance between two consecutive components is the distance of their centers
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e090.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e091.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>
<bold>Average direction bias.</bold>
For most text lines, the major orientations of the characters are nearly perpendicular to the major orientation of the text line. Direction bias of component
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e092.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e093.jpg"></inline-graphic>
</inline-formula>
, is the included angle between
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e094.jpg"></inline-graphic>
</inline-formula>
and the orientation of the chain.</p>
</list-item>
<list-item>
<p>
<bold>Average axial ratio.</bold>
Some repeated patterns (e.g. barriers) that are not texts consist of long and thin components, this feature can help differentiate them from true texts.</p>
</list-item>
<list-item>
<p>
<bold>Average density.</bold>
On the contrary, other repeated patterns (e.g. bricks) consist of short and fat components, this feature can be used to eliminate this kind of false positives.</p>
</list-item>
<list-item>
<p>
<bold>Average width variation.</bold>
False positives formed by foliage usually have varying widths while texts have constant widths. This feature is defined as the mean of all the width variation values of the candidates.</p>
</list-item>
<list-item>
<p>
<bold>Average color self-similarity.</bold>
Characters in a text line usually have similar but not identical color distributions with each other; yet in false positive chains, color self-similarities
<xref ref-type="bibr" rid="pone.0070173-Shechtman1">[64]</xref>
of the candidates are either too high (repeated patterns) or too low (random clutters). The color similarity
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e095.jpg"></inline-graphic>
</inline-formula>
is defined as the cosine similarity of the color histograms of the two candidates
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e096.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e097.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
<list-item>
<p>
<bold>Average structure self-similarity.</bold>
Likewise, characters in a text line have similar structure with each other while false positives usually have almost the same structure (repeated patterns) or diverse structures (random clutters). The structure similarity
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e098.jpg"></inline-graphic>
</inline-formula>
is defined as the cosine similarity of the edge shape descriptors of the two components
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e099.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e100.jpg"></inline-graphic>
</inline-formula>
.</p>
</list-item>
</list>
</sec>
</sec>
</sec>
<sec id="s3">
<title>Dataset and Evaluation Protocol</title>
<p>In this section, we introduce a large dataset for evaluating text detection algorithms, which contains 500 natural images with real-world complexity. In addition, a new evaluation methodology which is suitable for benchmarking algorithms designed for texts of arbitrary directions is proposed.</p>
<sec id="s3a">
<title>Dataset</title>
<p>Although widely used in the community, the ICDAR datasets
<xref ref-type="bibr" rid="pone.0070173-Lucas1">[47]</xref>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
have two major drawbacks. First, most of the text lines (or single characters) in the ICDAR datasets are horizontal. In real scenarios, however, text may appear in any orientation. The second drawback is that all the text lines or characters in this dataset are in English. Therefore it is unable to use these datasets to assess detection systems designed for multilingual scripts.</p>
<p>These two shortcomings have been pointed out in
<xref ref-type="bibr" rid="pone.0070173-Pan1">[13]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
. Two separate datasets are therefore created: one contains non-horizontal text lines
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
and the other one is a multilingual dataset
<xref ref-type="bibr" rid="pone.0070173-Pan1">[13]</xref>
. In this work, we generate a new multilingual text image dataset with horizontal as well as slant and skewed texts. We name this dataset MSRA Text Detection 500 Database (MSRA-TD500), because it includes 500 natural images in total. These images are taken from indoor (office and mall) and outdoor (street) scenes using a packet camera. The indoor images are mainly signs, doorplates and caution plates while the outdoor images are mostly guide boards and billboards in complex background. The resolutions of the images vary from 1296×864 to 1920×1280. This dataset is available at
<ext-link ext-link-type="uri" xlink:href="http://www.loni.ucla.edu/~ztu/publication/MSRA-TD500.zip">http://www.loni.ucla.edu/~ztu/publication/MSRA-TD500.zip</ext-link>
.</p>
<p>MSRA-TD500 is very challenging because of both the diversity of the texts and the complexity of the backgrounds in the images. The texts may be in different languages (Chinese, English and mixture of both), fonts, sizes, colors and orientations. The backgrounds may contain vegetation (e.g. trees and grasses) and repeated patterns (e.g. windows and bricks), which are not so distinguishable from text.</p>
<p>Some typical images from this dataset are shown in
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
. It is worth mentioning that even though the purpose of this dataset is to evaluate text detection algorithms designed for multi-oriented texts, horizontal and near-horizontal texts still dominate the dataset (about 2/3) because these are the most common cases in practice.</p>
<p>The dataset is divided into two parts: training set and test set. The training set contains 300 images randomly selected from the original dataset and the rest 200 images constitute the test set. All the images in this dataset are fully annotated. The basic unit in this dataset is text line rather than word, which is used in the ICDAR dataset, because it is hard to partition Chinese text lines into individual words based on their spacings; even for English text lines, it is non-trivial to perform word partition without high level information. The procedure of ground truth generation is shown in
<xref ref-type="fig" rid="pone-0070173-g004">Fig. 4</xref>
.</p>
</sec>
<sec id="s3b">
<title>Evaluation Protocol</title>
<p>Before presenting our novel evaluation protocol for text detection, we first introduce the evaluation method used in the ICDAR competitions
<xref ref-type="bibr" rid="pone.0070173-Lucas1">[47]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Lucas2">[48]</xref>
as background. Under the ICDAR evaluation protocol, the performance of an algorithm is measured by F-measure, which is the harmonic mean of precision and recall. Different from the standard information retrieval measures of precision and recall, more flexible definitions are adopted in the ICDAR competitions
<xref ref-type="bibr" rid="pone.0070173-Lucas1">[47]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Lucas2">[48]</xref>
. The match
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e101.jpg"></inline-graphic>
</inline-formula>
between two rectangles is defined as the ratio of the area of intersection and that of the minimum bounding rectangle containing both rectangles. The set of rectangles estimated by each algorithm are called
<italic>estimates</italic>
while the set of ground truth rectangles provided in the ICDAR dataset are called
<italic>targets</italic>
. For each rectangle, the match with the largest value is found. Hence, the best match for a rectangle
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e102.jpg"></inline-graphic>
</inline-formula>
in a set of rectangles
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e103.jpg"></inline-graphic>
</inline-formula>
is:
<disp-formula id="pone.0070173.e104">
<graphic xlink:href="pone.0070173.e104"></graphic>
<label>(8)</label>
</disp-formula>
</p>
<p>Then, the definitions of precision and recall are:
<disp-formula id="pone.0070173.e105">
<graphic xlink:href="pone.0070173.e105"></graphic>
<label>(9)</label>
</disp-formula>
<disp-formula id="pone.0070173.e106">
<graphic xlink:href="pone.0070173.e106"></graphic>
<label>(10)</label>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e107.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e108.jpg"></inline-graphic>
</inline-formula>
are the sets of ground truth rectangles and estimated rectangles, respectively. The F-measure, which is a single measure of algorithm performance, is a combination of the two above measures. The relative weights of precision and recall are controlled by a parameter
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e109.jpg"></inline-graphic>
</inline-formula>
, which is set to 0.5 to give equal weights to precision and recall:</p>
<p>
<disp-formula id="pone.0070173.e110">
<graphic xlink:href="pone.0070173.e110"></graphic>
<label>(11)</label>
</disp-formula>
Minimum area rectangles
<xref ref-type="bibr" rid="pone.0070173-Freeman1">[56]</xref>
are used in our protocol because they (green rectangle in
<xref ref-type="fig" rid="pone-0070173-g004">Fig. 4</xref>
(b)) are much tighter and more accurate than axis-aligned rectangles (red rectangle in
<xref ref-type="fig" rid="pone-0070173-g004">Fig. 4</xref>
(b)). However, a problem imposed by using minimum area rectangles is that it is difficult to judge whether a text line is correctly detected. As shown in
<xref ref-type="fig" rid="pone-0070173-g005">Fig. 5</xref>
, it is not trivial to directly compute the overlap ratio between the estimated rectangle
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e111.jpg"></inline-graphic>
</inline-formula>
and the ground truth rectangle
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e112.jpg"></inline-graphic>
</inline-formula>
. Instead, we calculate the overlap ratio using axis-aligned rectangles
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e113.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e114.jpg"></inline-graphic>
</inline-formula>
, which are obtained by rotating
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e115.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e116.jpg"></inline-graphic>
</inline-formula>
round their centers
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e117.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e118.jpg"></inline-graphic>
</inline-formula>
, respectively. The overlap ratio between
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e119.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e120.jpg"></inline-graphic>
</inline-formula>
is defined as:
<disp-formula id="pone.0070173.e121">
<graphic xlink:href="pone.0070173.e121"></graphic>
<label>(12)</label>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e122.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e123.jpg"></inline-graphic>
</inline-formula>
denote the areas of the intersection and union of
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e124.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e125.jpg"></inline-graphic>
</inline-formula>
. Obviously, the overlap ratio computed in this way is not accurate. Besides, the ground truth rectangles annotated are not accurate either, especially when the texts are skewed. Because of the imprecision of both ground truth and computed overlap ratio, the definitions of precision and recall used in the ICDAR protocol do not apply. Alternatively, we return to their original definitions.</p>
<p>Similar to the evaluation method for the PASCAL object detection task
<xref ref-type="bibr" rid="pone.0070173-Everingham1">[65]</xref>
, in our protocol detections are considered true or false positives based on the overlap ratio between the estimated minimum area rectangles and the ground truth rectangles. If the included angle of the estimated rectangle and the ground truth rectangle is less than
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e126.jpg"></inline-graphic>
</inline-formula>
and their overlap ratio exceeds 0.5, the estimated rectangle is considered a correct detection. Multiple detections of the same text line are taken as false positives. The definitions of precision and recall are:
<disp-formula id="pone.0070173.e127">
<graphic xlink:href="pone.0070173.e127"></graphic>
<label>(13)</label>
</disp-formula>
<disp-formula id="pone.0070173.e128">
<graphic xlink:href="pone.0070173.e128"></graphic>
<label>(14)</label>
</disp-formula>
where
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e129.jpg"></inline-graphic>
</inline-formula>
is the set of true positive detections while
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e130.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e131.jpg"></inline-graphic>
</inline-formula>
are the sets of estimated rectangles and ground truth rectangles.</p>
<p>Moreover, to accommodate difficult texts (too small, occluded, blurry, or truncated) that are hard for text detection algorithms, we introduce an elastic mechanism which can tolerate detection misses of difficult texts. The basic criterion of this elastic mechanism is:
<italic>if the difficult texts are detected by an algorithm, it counts; otherwise, the algorithm will not be punished</italic>
. Accordingly, the annotations of the images in the proposed dataset should be changed. Each text line considered to be difficult is given an additional “ difficult” label (
<xref ref-type="fig" rid="pone-0070173-g003">Fig. 3</xref>
). Thus the ground truth rectangles can be categorized into two sub sets: ordinary sub set
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e132.jpg"></inline-graphic>
</inline-formula>
and difficult sub set
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e133.jpg"></inline-graphic>
</inline-formula>
; ditto, the true positives
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e134.jpg"></inline-graphic>
</inline-formula>
can also be categorized into ordinary sub set
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e135.jpg"></inline-graphic>
</inline-formula>
, which is the set of rectangles matched with
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e136.jpg"></inline-graphic>
</inline-formula>
, and ordinary sub set
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e137.jpg"></inline-graphic>
</inline-formula>
, which is the set of rectangles matched with
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e138.jpg"></inline-graphic>
</inline-formula>
. After incorporating the elastic mechanism, the definitions of precision and recall become:
<disp-formula id="pone.0070173.e139">
<graphic xlink:href="pone.0070173.e139"></graphic>
<label>(15)</label>
</disp-formula>
<disp-formula id="pone.0070173.e140">
<graphic xlink:href="pone.0070173.e140"></graphic>
<label>(16)</label>
</disp-formula>
</p>
</sec>
</sec>
<sec id="s4">
<title>Experiments and Discussions</title>
<p>We have implemented the proposed algorithm in C++ and have evaluated it on a common server (2.53 GHz CPU, 48G RAM and Windows 64-bit OS). 200 trees are used for training the component level classifier and 100 trees for the chain level classifier. The threshold values are:
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e141.jpg"></inline-graphic>
</inline-formula>
,
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e142.jpg"></inline-graphic>
</inline-formula>
and
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e143.jpg"></inline-graphic>
</inline-formula>
. We have found empirically that the text detectors under this parameter setting work well for all the datasets used in this paper.</p>
<sec id="s4a">
<title>Results on Horizontal Texts</title>
<p>In order to compare the proposed algorithm with existing methods designed for horizontal texts, we have evaluated the algorithm on the standard dataset used in the ICDAR 2003 Rubust Reading Competition
<xref ref-type="bibr" rid="pone.0070173-Lucas1">[47]</xref>
and the ICDAR 2005 Text Locating Competition
<xref ref-type="bibr" rid="pone.0070173-Lucas2">[48]</xref>
. This dataset contains 509 fully annotated text images. 258 images from the dataset are used for training and 251 for testing. We train a text detector (denoted by TD-ICDAR) on the training images.</p>
<p>Some detected texts of the proposed algorithm are presented in Fig. 7 of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
. Our algorithm can handle several types of challenges, e.g. variations in text font, color and size, as well as repeated patterns and background clutters. The quantitative comparison of different methods evaluated on the ICDAR test set is shown in
<xref ref-type="table" rid="pone-0070173-t002">Tab. 2</xref>
of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
. As can be seen, our method compares favorably with the state-of-the-art when dealing with horizontal texts.</p>
<table-wrap id="pone-0070173-t002" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.t002</object-id>
<label>Table 2</label>
<caption>
<title>Performances of different text detection methods evaluated on the ICDAR 2011 dataset
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
.</title>
</caption>
<alternatives>
<graphic id="pone-0070173-t002-2" xlink:href="pone.0070173.t002"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Algorithm</td>
<td align="left" rowspan="1" colspan="1">Precision</td>
<td align="left" rowspan="1" colspan="1">Recall</td>
<td align="left" rowspan="1" colspan="1">F-measure</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">TD-ICDAR2011</td>
<td align="left" rowspan="1" colspan="1">0.7215</td>
<td align="left" rowspan="1" colspan="1">0.5952</td>
<td align="left" rowspan="1" colspan="1">0.6523</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Kim
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0.8298</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0.6247</bold>
</td>
<td align="left" rowspan="1" colspan="1">
<bold>0.7128</bold>
</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Yi
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.6722</td>
<td align="left" rowspan="1" colspan="1">0.5809</td>
<td align="left" rowspan="1" colspan="1">0.6232</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Yang
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.6697</td>
<td align="left" rowspan="1" colspan="1">0.5768</td>
<td align="left" rowspan="1" colspan="1">0.6198</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Neumann
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.6893</td>
<td align="left" rowspan="1" colspan="1">0.5254</td>
<td align="left" rowspan="1" colspan="1">0.5963</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Shao
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.6352</td>
<td align="left" rowspan="1" colspan="1">0.5352</td>
<td align="left" rowspan="1" colspan="1">0.5809</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Guyomard
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.6297</td>
<td align="left" rowspan="1" colspan="1">0.5007</td>
<td align="left" rowspan="1" colspan="1">0.5578</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Lee
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.5967</td>
<td align="left" rowspan="1" colspan="1">0.4457</td>
<td align="left" rowspan="1" colspan="1">0.5103</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Sun
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.3501</td>
<td align="left" rowspan="1" colspan="1">0.3832</td>
<td align="left" rowspan="1" colspan="1">0.3659</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Hanif
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.5505</td>
<td align="left" rowspan="1" colspan="1">0.2596</td>
<td align="left" rowspan="1" colspan="1">0.3419</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>It is noted that existing algorithms seem to converge in performance (with F-measure around 0.66) on the ICDAR dataset. This might be due to three reasons: (1) the ICDAR evaluation method is different from the conventional methods for object detection (e.g. the PASCAL evaluation method
<xref ref-type="bibr" rid="pone.0070173-Everingham1">[65]</xref>
). The ICDAR evaluation method actually requires pixel-level accuracy (see
<xref ref-type="disp-formula" rid="pone.0070173.e105">Eqn. 9</xref>
and
<xref ref-type="disp-formula" rid="pone.0070173.e106">Eqn. 10</xref>
), which is rigorous for detection algorithms, considering that the ground truth is given in the form of rough rectangles. (2) The ICDAR evaluation method requires word partition, that is, dividing text lines into individual words. This limits the scores of text detection algorithms either; because it is non-trivial to perform word partition without high level information. Moreover, the definitions of “word” are not consistent among different images. (3) Most algorithms assume that in the image a word or text line consists of at least two characters. However, in the ICDAR dataset some images contain single characters. In these images, most existing algorithms will fail to detect the single characters.</p>
<p>The ICDAR 2011 Robust Reading Competition Challenge 2
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
was held to track the recent progress in the filed of scene text detection and recognition. Due to the problems with the dataset used in the previous ICDAR competitions (for example, imprecise bounding boxes and inconsistent definitions of “word”), the dataset in the ICDAR 2011 competition is extended and relabeled
<xref ref-type="bibr" rid="pone.0070173-Shahab1">[49]</xref>
. Moreover, the evaluation method proposed by Wolf
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Wolf1">[66]</xref>
is adopted as the standard for performance evaluation, to replace the previous evaluation protocol, which is unable to handle the cases of one-to-many and many-to-many matches and thus consistently underestimates the capability of text detection algorithms.</p>
<p>To enable fair comparison, we have also trained a text detector (denoted by TD-ICDAR2011) using the training set of the ICDAR 2011 competition dataset, performed text detection on the test images and measured the performance using the method of Wolf
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Wolf1">[66]</xref>
.
<xref ref-type="fig" rid="pone-0070173-g006">Fig. 6</xref>
illustrates several detection examples of our method on this dataset. The quantitative results of different text detection methods on the ICDAR 2011 dataset are shown in
<xref ref-type="table" rid="pone-0070173-t002">Tab. 2</xref>
. The proposed algorithm achieves the second highest F-measure on this dataset.</p>
<fig id="pone-0070173-g006" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.g006</object-id>
<label>Figure 6</label>
<caption>
<title>Detected texts in images from the ICDAR 2011 test set.</title>
</caption>
<graphic xlink:href="pone.0070173.g006"></graphic>
</fig>
</sec>
<sec id="s4b">
<title>Results on Multi-oriented Texts</title>
<p>We have also trained a text detector (denoted by TD-MSRA) on mixture of the training set of the proposed dataset and that of ICDAR and compared it to the systems of Epshtein
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
and Chen
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Chen1">[7]</xref>
. The executables of these system are obtained from the authors. Detection examples of the proposed algorithm on this dataset are shown in
<xref ref-type="fig" rid="pone-0070173-g008">Fig. 8</xref>
(a) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
. The proposed algorithm is able to detect texts of large variation in natural scenes, e.g., skewed and curved text. The images in the last row of
<xref ref-type="fig" rid="pone-0070173-g008">Fig. 8</xref>
(a) of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
are some typical cases where our algorithm failed to detect the texts or gave false positives. The misses (pink rectangles) are mainly due to strong highlights, blur and low resolution; the false positives (red rectangles) are usually caused by elements that are very alike text, such as windows, trees, and signs.</p>
<p>The performances are measured using the proposed evaluation protocol and shown in
<xref ref-type="table" rid="pone-0070173-t003">Tab. 3</xref>
of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
. Compared with the competing algorithms, the proposed method achieves significantly enhanced performance when detecting texts of different orientations. The performances of other competing algorithms are not presented because of unavailability of their codes/executables. The average processing time of our algorithm on this dataset is 7.2 s and that of Epshtein
<italic>et al.</italic>
is 6 s. Our algorithm is a bit slower, but with the advantage of being able to detect multi-oriented texts.</p>
<table-wrap id="pone-0070173-t003" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.t003</object-id>
<label>Table 3</label>
<caption>
<title>Performances of different text detection methods evaluated on texts of different languages.</title>
</caption>
<alternatives>
<graphic id="pone-0070173-t003-3" xlink:href="pone.0070173.t003"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">Algorithm</td>
<td align="left" rowspan="1" colspan="1">Precision</td>
<td align="left" rowspan="1" colspan="1">Recall</td>
<td align="left" rowspan="1" colspan="1">F-measure</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">TD-MSRA</td>
<td align="left" rowspan="1" colspan="1">0.73</td>
<td align="left" rowspan="1" colspan="1">0.64</td>
<td align="left" rowspan="1" colspan="1">0.66</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Epshtein
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.58</td>
<td align="left" rowspan="1" colspan="1">0.65</td>
<td align="left" rowspan="1" colspan="1">0.59</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Chen
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Chen1">[7]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.06</td>
<td align="left" rowspan="1" colspan="1">0.08</td>
<td align="left" rowspan="1" colspan="1">0.07</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>In
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
, a dataset called Oriented Scene Text Database (OSTD), which contains texts of various orientations, is released. This dataset includes 89 images of logos, indoor scenes and street views. We perform text detection on all the images in this dataset. The quantitative results are presented in
<xref ref-type="table" rid="pone-0070173-t004">Tab. 4</xref>
of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
. Our method outperforms
<xref ref-type="bibr" rid="pone.0070173-Yi1">[14]</xref>
on the Oriented Scene Text Database (OSTD), with an improvement of 0.17 in F-measure.</p>
<table-wrap id="pone-0070173-t004" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.t004</object-id>
<label>Table 4</label>
<caption>
<title>End-to-end scene text recognition performances.</title>
</caption>
<alternatives>
<graphic id="pone-0070173-t004-4" xlink:href="pone.0070173.t004"></graphic>
<table frame="hsides" rules="groups">
<colgroup span="1">
<col align="left" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
<col align="center" span="1"></col>
</colgroup>
<thead>
<tr>
<td align="left" rowspan="1" colspan="1">System</td>
<td align="left" rowspan="1" colspan="1">Precision</td>
<td align="left" rowspan="1" colspan="1">Recall</td>
<td align="left" rowspan="1" colspan="1">F-measure</td>
</tr>
</thead>
<tbody>
<tr>
<td align="left" rowspan="1" colspan="1">Ours</td>
<td align="left" rowspan="1" colspan="1">0.58</td>
<td align="left" rowspan="1" colspan="1">0.51</td>
<td align="left" rowspan="1" colspan="1">0.53</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Epshtein
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
</td>
<td align="left" rowspan="1" colspan="1">0.57</td>
<td align="left" rowspan="1" colspan="1">0.49</td>
<td align="left" rowspan="1" colspan="1">0.51</td>
</tr>
<tr>
<td align="left" rowspan="1" colspan="1">Direct OCR</td>
<td align="left" rowspan="1" colspan="1">0.13</td>
<td align="left" rowspan="1" colspan="1">0.10</td>
<td align="left" rowspan="1" colspan="1">0.11</td>
</tr>
</tbody>
</table>
</alternatives>
</table-wrap>
<p>From
<xref ref-type="table" rid="pone-0070173-t003">Tab. 3</xref>
and
<xref ref-type="table" rid="pone-0070173-t004">4</xref>
of
<xref ref-type="bibr" rid="pone.0070173-Yao1">[23]</xref>
, we observe that even TD-ICDAR (only trained on horizontal texts) achieves much better performance than other methods on non-horizontal texts. It demonstrates the effectiveness of the proposed rotation-invariant features.</p>
</sec>
<sec id="s4c">
<title>Results on Texts of Different Languages</title>
<p>To further verify the ability of the proposed algorithm to detect texts of different languages, we have collected a multilingual text image database (Will be available at
<ext-link ext-link-type="uri" xlink:href="http://www.loni.ucla.edu/~ztu/publication/">http://www.loni.ucla.edu/~ztu/publication/</ext-link>
) from the Internet. The database contains 94 natural images with texts of various languages, including both oriental and western languages, such as Japanese, Korean, Arabic, Greek, and Russian. We apply TD-MSRA to all the images in this database.
<xref ref-type="fig" rid="pone-0070173-g007">Fig. 7</xref>
shows some detected texts in images from this database. The algorithms of Epshtein
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
and Chen
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Chen1">[7]</xref>
are adopted as baselines. The quantitative results of these algorithms are presented in
<xref ref-type="table" rid="pone-0070173-t003">Tab. 3</xref>
. The proposed algorithm and the method of Epshtein
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
both give excellent performance on this benchmark. Though TD-MSRA is only trained on Chinese and English texts, it can effortlessly generalize to texts in different languages. This indicates that the proposed algorithm is quite general and it can serve as a multilingual text detector.</p>
<fig id="pone-0070173-g007" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.g007</object-id>
<label>Figure 7</label>
<caption>
<title>Detected texts in various languages.</title>
<p>The images are collected from the Internet.</p>
</caption>
<graphic xlink:href="pone.0070173.g007"></graphic>
</fig>
</sec>
<sec id="s4d">
<title>Special Consideration on Single Characters</title>
<p>Most existing algorithms cannot handle single characters, since they assume that in the image a word or text line consists of at least two characters. To overcome this limitation, we have modified the proposed algorithm to handle single characters. In the candidate linking stage, we no longer simply discard all single character candidates but instead retain the character candidates with high probabilities (
<inline-formula>
<inline-graphic xlink:href="pone.0070173.e144.jpg"></inline-graphic>
</inline-formula>
), even if they do not belong to any chain. After this modification, the proposed algorithm is able to detect obvious single characters in natural images.
<xref ref-type="fig" rid="pone-0070173-g008">Fig. 8</xref>
depicts some detected single characters by the proposed algorithm.</p>
<fig id="pone-0070173-g008" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.g008</object-id>
<label>Figure 8</label>
<caption>
<title>Detected single characters in images.</title>
<p>Images are from the ICDAR dataset
<xref ref-type="bibr" rid="pone.0070173-Lucas1">[47]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Lucas2">[48]</xref>
.</p>
</caption>
<graphic xlink:href="pone.0070173.g008"></graphic>
</fig>
<p>To assess the effectiveness of the proposed strategy for single character detection, we have conducted an additional experiment. The algorithm is applied to the images containing single characters from the ICDAR dataset
<xref ref-type="bibr" rid="pone.0070173-Lucas1">[47]</xref>
,
<xref ref-type="bibr" rid="pone.0070173-Lucas2">[48]</xref>
, with and without single character detection. Without single character detection, the algorithm achieves precision = 0.56, recall = 0.28 and F-measure = 0.36; with single character detection, the algorithm achieves precision = 0.62, recall = 0.40 and F-measure = 0.47. The performance is significantly improved after enabling single character detection.</p>
</sec>
<sec id="s4e">
<title>End-to-End Scene Text Recognition</title>
<p>As can be seen from previous experiments, the proposed text detection algorithm works very well under fairly broad realistic conditions. Thus, one could combine it with any of the existing Optical Character Recognition (OCR) engines to build an end-to-end recognition system for multi-oriented text. A likely pipeline of such a system is illustrated in
<xref ref-type="fig" rid="pone-0070173-g009">Fig. 9:</xref>
We first apply our text detection algorithm to the original image. If the detected text regions have significant deformation, we then rectify them by the low-rank structure based rectification technique TILT
<xref ref-type="bibr" rid="pone.0070173-Zhang1">[67]</xref>
. Next, we binarize the text regions with adaptive thresholding and feed the binary images into an off-the-shelf OCR software
<xref ref-type="bibr" rid="pone.0070173-Wintone1">[68]</xref>
to produce the final recognition result.</p>
<fig id="pone-0070173-g009" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.g009</object-id>
<label>Figure 9</label>
<caption>
<title>Pipeline of our end-to-end scene text recognition system.</title>
</caption>
<graphic xlink:href="pone.0070173.g009"></graphic>
</fig>
<p>Since there is no standard benchmark for multi-oriented English text recognition (The NEOCR dataset
<xref ref-type="bibr" rid="pone.0070173-Nagy1">[69]</xref>
includes images with multi-oriented texts in natural scenes. However, the texts in this database are in different languages, such as Hungarian, Russian, Turkish and Czech, which are not supported by our end-to-end recognition system currently), we collect a dataset (Will be available at
<ext-link ext-link-type="uri" xlink:href="http://www.loni.ucla.edu/~ztu/publication/">http://www.loni.ucla.edu/~ztu/publication/</ext-link>
) of 80 natural images with slant and skewed English texts and Arabic numbers, to evaluate the proposed end-to-end text recognition system. Majority of the images are from the MSRA-TD500 database and the rest images are from the Internet.
<xref ref-type="fig" rid="pone-0070173-g010">Fig. 10</xref>
shows several typical images from this database.</p>
<fig id="pone-0070173-g010" orientation="portrait" position="float">
<object-id pub-id-type="doi">10.1371/journal.pone.0070173.g010</object-id>
<label>Figure 10</label>
<caption>
<title>Examples of the images collected for end-to-end scene text recognition.</title>
</caption>
<graphic xlink:href="pone.0070173.g010"></graphic>
</fig>
<p>For comparison, we have tested the end-to-end text recognition system of Epshtein
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
on this dataset. To demonstrate how text detection can help effectively extract text information from natural images, we have also performed character recognition directly on the original images (denoted by Direct OCR). The quantitative performances are computed at character level and shown in
<xref ref-type="table" rid="pone-0070173-t004">Tab. 4</xref>
. As can be seen, applying OCR directly to natural images gives very poor performance, because of the variations of texts and background clutters. In contrast, both our scene text recognition system and that of Epshtein
<italic>et al.</italic>
<xref ref-type="bibr" rid="pone.0070173-Epshtein1">[9]</xref>
achieve much higher performance. This suggests that text detection is a crucial step when extracting text information from natural images.</p>
<p>We have also examined why in this experiment the improvement of our system over that of Epshtein’s is not so dramatic as in previous pure detection experiments. The main reason is that although our system can detect more texts, some of the fonts in these natural images cannot be handled well by the current OCR system. This suggests that to build a truly high-performance text recognition system for texts in natural images, there is still significant challenge for further improvement in the text recognition component, especially in recognizing texts with more diverse fonts, sizes, and orientations. From our observation and preliminary study, some of the discriminative features that we have extracted for detection purpose can be very useful for subsequent text recognition as well. We leave a more careful study of a unified text detection and recognition system for future work.</p>
</sec>
<sec id="s4f">
<title>Conclusions</title>
<p>We have presented a text detection system that is capable of detecting texts of varying directions in complex natural scenes. Our system compares favorably with the state-of-the-art algorithms when handling horizontal texts and achieves significantly enhanced performance on multi-oriented texts. Furthermore, we have proposed a multilingual database with horizontal as well as non-horizontal texts and specifically designed an evaluation protocol for benchmarking algorithms for multi-oriented texts.</p>
<p>The component level features are actually character descriptors that can distinguish among different characters, thus they can be adopted to recognize characters. We plan to make use of this property and develop a unified framework for text detection and character recognition in the future.</p>
</sec>
</sec>
</body>
<back>
<ack>
<p>The authors would like to thank A. L. Yuille and E. Ofek for providing their executables. Our thanks also go to T. Mei and X. Lin for their help in developing the annotation tool and labelling the ground truth rectangles.</p>
</ack>
<ref-list>
<title>References</title>
<ref id="pone.0070173-DeSouza1">
<label>1</label>
<mixed-citation publication-type="journal">
<name>
<surname>DeSouza</surname>
<given-names>GN</given-names>
</name>
,
<name>
<surname>Kak</surname>
<given-names>AC</given-names>
</name>
(
<year>2002</year>
)
<article-title>Vision for mobile robot navigation: A survey</article-title>
.
<source>IEEE Trans PAMI</source>
<volume>24</volume>
:
<fpage>237</fpage>
<lpage>267</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Tsai1">
<label>2</label>
<mixed-citation publication-type="other">Tsai S, Chen H, Chen D, Schroth G, Grzeszczuk R,
<etal>et al</etal>
. (2011) Mobile visual search on printed documents using text and low bit-rate features. In: Proc. of ICIP.</mixed-citation>
</ref>
<ref id="pone.0070173-Kisacanin1">
<label>3</label>
<mixed-citation publication-type="other">Kisacanin B, Pavlovic V, Huang TS (2005) Real-time vision for human-computer interaction. Springer.</mixed-citation>
</ref>
<ref id="pone.0070173-Jain1">
<label>4</label>
<mixed-citation publication-type="journal">
<name>
<surname>Jain</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Yu</surname>
<given-names>B</given-names>
</name>
(
<year>1998</year>
)
<article-title>Automatic text location in images and video frames</article-title>
.
<source>Pattern Recognition</source>
<volume>31</volume>
:
<fpage>2055</fpage>
<lpage>2076</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Hasan1">
<label>5</label>
<mixed-citation publication-type="journal">
<name>
<surname>Hasan</surname>
<given-names>YMY</given-names>
</name>
,
<name>
<surname>Karam</surname>
<given-names>LJ</given-names>
</name>
(
<year>2000</year>
)
<article-title>Morphological text extraction from images</article-title>
.
<source>IEEE Trans Image Processing</source>
<volume>9</volume>
:
<fpage>1978</fpage>
<lpage>1983</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Kim1">
<label>6</label>
<mixed-citation publication-type="journal">
<name>
<surname>Kim</surname>
<given-names>KI</given-names>
</name>
,
<name>
<surname>Jung</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Kim</surname>
<given-names>JH</given-names>
</name>
(
<year>2003</year>
)
<article-title>Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm</article-title>
.
<source>IEEE Trans PAMI</source>
<volume>25</volume>
:
<fpage>1631</fpage>
<lpage>1639</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Chen1">
<label>7</label>
<mixed-citation publication-type="other">Chen X, Yuille A (2004) Detecting and reading text in natural scenes. In: Proc. of CVPR.</mixed-citation>
</ref>
<ref id="pone.0070173-Chen2">
<label>8</label>
<mixed-citation publication-type="journal">
<name>
<surname>Chen</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Odobez</surname>
<given-names>JM</given-names>
</name>
,
<name>
<surname>Bourlard</surname>
<given-names>H</given-names>
</name>
(
<year>2004</year>
)
<article-title>Text detection and recognition in images and video frames</article-title>
.
<source>Pattern Recognition</source>
<volume>37</volume>
:
<fpage>595</fpage>
<lpage>608</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Epshtein1">
<label>9</label>
<mixed-citation publication-type="other">Epshtein B, Ofek E, Wexler Y (2010) Detecting text in natural scenes with stroke width transform. In: Proc. of CVPR.</mixed-citation>
</ref>
<ref id="pone.0070173-Wang1">
<label>10</label>
<mixed-citation publication-type="other">Wang K, Belongie S (2010) Word spotting in the wild. In: Proc. of ECCV.</mixed-citation>
</ref>
<ref id="pone.0070173-Neumann1">
<label>11</label>
<mixed-citation publication-type="other">Neumann L, Matas J (2010) A method for text localization and recognition in real-world images. In: Proc. of ACCV.</mixed-citation>
</ref>
<ref id="pone.0070173-Zhao1">
<label>12</label>
<mixed-citation publication-type="journal">
<name>
<surname>Zhao</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>ST</given-names>
</name>
,
<name>
<surname>Kwok</surname>
<given-names>J</given-names>
</name>
(
<year>2010</year>
)
<article-title>Text detection in images using sparse representation with discrim-inative dictionaries</article-title>
.
<source>IVC</source>
<volume>28</volume>
:
<fpage>1590</fpage>
<lpage>1599</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Pan1">
<label>13</label>
<mixed-citation publication-type="journal">
<name>
<surname>Pan</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Hou</surname>
<given-names>X</given-names>
</name>
,
<name>
<surname>Liu</surname>
<given-names>C</given-names>
</name>
(
<year>2011</year>
)
<article-title>A hybrid approach to detect and localize texts in natural scene images</article-title>
.
<source>IEEE Trans Image Processing</source>
<volume>20</volume>
:
<fpage>800</fpage>
<lpage>813</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Yi1">
<label>14</label>
<mixed-citation publication-type="journal">
<name>
<surname>Yi</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Tian</surname>
<given-names>Y</given-names>
</name>
(
<year>2011</year>
)
<article-title>Text string detection from natural scenes by structure-based partition and grouping</article-title>
.
<source>IEEE Trans Image Processing</source>
<volume>20</volume>
:
<fpage>2594</fpage>
<lpage>2605</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Coates1">
<label>15</label>
<mixed-citation publication-type="other">Coates A, Carpenter B, Case C, Satheesh S, Suresh B,
<etal>et al</etal>
. (2011) Text detection and character recognition in scene images with unsupervised feature learning. In: Proc. of ICDAR.</mixed-citation>
</ref>
<ref id="pone.0070173-Shivakumara1">
<label>16</label>
<mixed-citation publication-type="journal">
<name>
<surname>Shivakumara</surname>
<given-names>P</given-names>
</name>
,
<name>
<surname>Phan</surname>
<given-names>TQ</given-names>
</name>
,
<name>
<surname>Tan</surname>
<given-names>CL</given-names>
</name>
(
<year>2011</year>
)
<article-title>A laplacian approach to multi-oriented text detection in video</article-title>
.
<source>IEEE Trans PAMI</source>
<volume>33</volume>
:
<fpage>412</fpage>
<lpage>419</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Bouman1">
<label>17</label>
<mixed-citation publication-type="journal">
<name>
<surname>Bouman</surname>
<given-names>KL</given-names>
</name>
,
<name>
<surname>Abdollahian</surname>
<given-names>G</given-names>
</name>
,
<name>
<surname>Boutin</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Delp</surname>
<given-names>EJ</given-names>
</name>
(
<year>2011</year>
)
<article-title>A low complexity sign detection and text localization method for mobile applications</article-title>
.
<source>IEEE Trans Multimedia</source>
<volume>13</volume>
:
<fpage>922</fpage>
<lpage>934</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Neumann2">
<label>18</label>
<mixed-citation publication-type="other">Neumann L, Matas J (2012) Real-time scene text localization and recognition. In: Proc. of CVPR.</mixed-citation>
</ref>
<ref id="pone.0070173-Minetto1">
<label>19</label>
<mixed-citation publication-type="other">Minetto R, Thome N, Cord M, Leite NJ, Stolfi J (2013 (accepted)) T-hog: an effective gradient-based descriptor for single line text regions. Pattern Recognition .</mixed-citation>
</ref>
<ref id="pone.0070173-Wang2">
<label>20</label>
<mixed-citation publication-type="other">Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: Proc. of ICCV.</mixed-citation>
</ref>
<ref id="pone.0070173-ABBYY1">
<label>21</label>
<mixed-citation publication-type="other">ABBYY (2012) ABBYY Mobile Products. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.abbyy.com/mobile/">http://www.abbyy.com/mobile/</ext-link>
Accessed 2013 Jun. 22.</mixed-citation>
</ref>
<ref id="pone.0070173-GVision1">
<label>22</label>
<mixed-citation publication-type="other">GVision (2012) 3GVision’s Business Card Reader Application. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.i-nigma.com/TextRecognition.html">http://www.i-nigma.com/TextRecognition.html</ext-link>
Accessed 2013 Jun. 22.</mixed-citation>
</ref>
<ref id="pone.0070173-Yao1">
<label>23</label>
<mixed-citation publication-type="other">Yao C, Bai X, Liu W, Ma Y, Tu Z (2012) Detecting texts of arbitrary orientations in natural images. In: Proc. of CVPR.</mixed-citation>
</ref>
<ref id="pone.0070173-Gllavata1">
<label>24</label>
<mixed-citation publication-type="other">Gllavata J, Ewerth R, Freisleben B (2004) Text detection in images based on unsupervised classi-fication of high-frequency wavelet coefficients. In: Proc. of ICPR.</mixed-citation>
</ref>
<ref id="pone.0070173-Chen3">
<label>25</label>
<mixed-citation publication-type="other">Chen H, Tsai S, Schroth G, Chen D, Grzeszczuk R,
<etal>et al</etal>
. (2011) Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In: Proc. of ICIP.</mixed-citation>
</ref>
<ref id="pone.0070173-Zhao2">
<label>26</label>
<mixed-citation publication-type="journal">
<name>
<surname>Zhao</surname>
<given-names>X</given-names>
</name>
,
<name>
<surname>Lin</surname>
<given-names>KH</given-names>
</name>
,
<name>
<surname>Fu</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Hu</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
,
<etal>et al</etal>
(
<year>2011</year>
)
<article-title>Text from corners: A novel approach to detect text and caption in videos</article-title>
.
<source>IEEE Trans Image Processing</source>
<volume>20</volume>
:
<fpage>790</fpage>
<lpage>799</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Yi2">
<label>27</label>
<mixed-citation publication-type="journal">
<name>
<surname>Yi</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Tian</surname>
<given-names>Y</given-names>
</name>
(
<year>2011</year>
)
<article-title>Localizing text in scene images by boundary clustering, stroke segmentation, and string fragment classification</article-title>
.
<source>IEEE Trans Image Processing</source>
<volume>21</volume>
:
<fpage>4256</fpage>
<lpage>4268</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Liu1">
<label>28</label>
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>X</given-names>
</name>
,
<name>
<surname>Wang</surname>
<given-names>W</given-names>
</name>
(
<year>2012</year>
)
<article-title>Robustly extracting captions in videos based on stroke-like edges and spatio-temporal analysis</article-title>
.
<source>IEEE Trans Multimedia</source>
<volume>14</volume>
:
<fpage>482</fpage>
<lpage>489</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Jung1">
<label>29</label>
<mixed-citation publication-type="journal">
<name>
<surname>Jung</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Kim</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Jain</surname>
<given-names>A</given-names>
</name>
(
<year>2004</year>
)
<article-title>Text information extraction in images and video: a survey</article-title>
.
<source>PR</source>
<volume>37</volume>
:
<fpage>977</fpage>
<lpage>997</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Liang1">
<label>30</label>
<mixed-citation publication-type="journal">
<name>
<surname>Liang</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Doermann</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Li</surname>
<given-names>H</given-names>
</name>
(
<year>2005</year>
)
<article-title>Camera-based analysis of text and documents: a survey</article-title>
.
<source>IJDAR</source>
<volume>7</volume>
:
<fpage>84</fpage>
<lpage>104</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Zhong1">
<label>31</label>
<mixed-citation publication-type="journal">
<name>
<surname>Zhong</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Karu</surname>
<given-names>K</given-names>
</name>
,
<name>
<surname>Jain</surname>
<given-names>AK</given-names>
</name>
(
<year>1995</year>
)
<article-title>Locating text in complex color images</article-title>
.
<source>Pattern Recognition</source>
<volume>28</volume>
:
<fpage>1523</fpage>
<lpage>1535</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Wu1">
<label>32</label>
<mixed-citation publication-type="other">Wu V, Manmatha R, Riseman EM (1997) Finding text in images. In: Proc. of 2nd ACM Int. Conf. Digital Libraries.</mixed-citation>
</ref>
<ref id="pone.0070173-Li1">
<label>33</label>
<mixed-citation publication-type="journal">
<name>
<surname>Li</surname>
<given-names>HP</given-names>
</name>
,
<name>
<surname>Doermann</surname>
<given-names>D</given-names>
</name>
,
<name>
<surname>Kia</surname>
<given-names>O</given-names>
</name>
(
<year>2000</year>
)
<article-title>Automatic text detection and tracking in digital video</article-title>
.
<source>IEEE Trans Image Processing</source>
<volume>9</volume>
:
<fpage>147</fpage>
<lpage>156</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Zhong2">
<label>34</label>
<mixed-citation publication-type="journal">
<name>
<surname>Zhong</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Zhang</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Jain</surname>
<given-names>AK</given-names>
</name>
(
<year>2000</year>
)
<article-title>Automatic caption localization in compressed video</article-title>
.
<source>IEEE Trans PAMI</source>
<volume>22</volume>
:
<fpage>385</fpage>
<lpage>392</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Lienhart1">
<label>35</label>
<mixed-citation publication-type="journal">
<name>
<surname>Lienhart</surname>
<given-names>R</given-names>
</name>
,
<name>
<surname>Wernicke</surname>
<given-names>A</given-names>
</name>
(
<year>2002</year>
)
<article-title>Localizing and segmenting text in images and videos</article-title>
.
<source>IEEE Trans CSVT</source>
<volume>12</volume>
:
<fpage>256</fpage>
<lpage>268</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Weinman1">
<label>36</label>
<mixed-citation publication-type="other">Weinman J, Hanson A, McCallum A (2004) Sign detection in natural images with conditional random fields. In: Proc. of WMLSP.</mixed-citation>
</ref>
<ref id="pone.0070173-Lyu1">
<label>37</label>
<mixed-citation publication-type="journal">
<name>
<surname>Lyu</surname>
<given-names>MR</given-names>
</name>
,
<name>
<surname>Song</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Cai</surname>
<given-names>M</given-names>
</name>
(
<year>2005</year>
)
<article-title>A comprehensive method for multilingual video text detection, localization, and extraction</article-title>
.
<source>IEEE Trans CSVT</source>
<volume>15</volume>
:
<fpage>243</fpage>
<lpage>255</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Ikica1">
<label>38</label>
<mixed-citation publication-type="other">Ikica A, Peer P (2011) An improved edge profile based method for text detection in images of natural scenes. In: Proc. of EUROCON.</mixed-citation>
</ref>
<ref id="pone.0070173-Minetto2">
<label>39</label>
<mixed-citation publication-type="other">Minetto R, Thome N, Cord M, Fabrizio J, Marcotegui B (2010) Snoopertext: A multiresolution system for text detection in complex visual scenes. In: Proc. of ICIP.</mixed-citation>
</ref>
<ref id="pone.0070173-Neumann3">
<label>40</label>
<mixed-citation publication-type="other">Neumann L, Matas J (2011) Text localization in real-world images using efficiently pruned exhaus-tive search. In: Proc. of ICDAR.</mixed-citation>
</ref>
<ref id="pone.0070173-Wright1">
<label>41</label>
<mixed-citation publication-type="journal">
<name>
<surname>Wright</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Yang</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Ganesh</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Sastry</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Ma</surname>
<given-names>Y</given-names>
</name>
(
<year>2009</year>
)
<article-title>Robust face recognition via sparse represen-tation</article-title>
.
<source>IEEE Trans PAMI</source>
<volume>31</volume>
:
<fpage>210</fpage>
<lpage>227</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Elad1">
<label>42</label>
<mixed-citation publication-type="journal">
<name>
<surname>Elad</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Aharon</surname>
<given-names>M</given-names>
</name>
(
<year>2006</year>
)
<article-title>Image denoising via sparse and redundant representations over learned dictionaries</article-title>
.
<source>IEEE Trans Image Processing</source>
<volume>15</volume>
:
<fpage>3736</fpage>
<lpage>3745</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Pan2">
<label>43</label>
<mixed-citation publication-type="other">Pan W, Bui TD, Suen CY (2009) Text detection from natural scene images using topographic maps and sparse representations. In: Proc. of ICIP.</mixed-citation>
</ref>
<ref id="pone.0070173-Matas1">
<label>44</label>
<mixed-citation publication-type="other">Matas J, Chum O, MUrban, Pajdla T (2002) Robust wide baseline stereo from maximally stable extremal regions. In: Proc. of BMVC.</mixed-citation>
</ref>
<ref id="pone.0070173-Liu2">
<label>45</label>
<mixed-citation publication-type="journal">
<name>
<surname>Liu</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Goto</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Ikenaga</surname>
<given-names>T</given-names>
</name>
(
<year>2006</year>
)
<article-title>A contour-based robust algorithm for text detection in color images</article-title>
.
<source>IEICE Trans Inf Syst</source>
<volume>E89-D</volume>
:
<fpage>1221</fpage>
<lpage>1230</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Dalal1">
<label>46</label>
<mixed-citation publication-type="other">Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proc. Of CVPR.</mixed-citation>
</ref>
<ref id="pone.0070173-Lucas1">
<label>47</label>
<mixed-citation publication-type="other">Lucas SM, Panaretos A, Sosa L, Tang A, Wong S,
<etal>et al</etal>
. (2003) ICDAR 2003 robust reading competitions. In: Proc. of ICDAR.</mixed-citation>
</ref>
<ref id="pone.0070173-Lucas2">
<label>48</label>
<mixed-citation publication-type="other">Lucas SM (2005) ICDAR 2005 text locating competition results. In: Proc. of ICDAR.</mixed-citation>
</ref>
<ref id="pone.0070173-Shahab1">
<label>49</label>
<mixed-citation publication-type="other">Shahab A, Shafait F, Dengel A (2011) ICDAR 2011 robust reading competition challenge 2: Read-ing text in scene images. In: Proc. of ICDAR.</mixed-citation>
</ref>
<ref id="pone.0070173-Hua1">
<label>50</label>
<mixed-citation publication-type="journal">
<name>
<surname>Hua</surname>
<given-names>XS</given-names>
</name>
,
<name>
<surname>Liu</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Zhang</surname>
<given-names>HJ</given-names>
</name>
(
<year>2004</year>
)
<article-title>An automatic performance evaluation protocol for video text detection algorithms</article-title>
.
<source>IEEE Trans CSVT</source>
<volume>14</volume>
:
<fpage>498</fpage>
<lpage>507</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Canny1">
<label>51</label>
<mixed-citation publication-type="journal">
<name>
<surname>Canny</surname>
<given-names>JF</given-names>
</name>
(
<year>1986</year>
)
<article-title>A computational approach to edge detection</article-title>
.
<source>IEEE Trans PAMI</source>
<volume>8</volume>
:
<fpage>679</fpage>
<lpage>698</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Bradski1">
<label>52</label>
<mixed-citation publication-type="other">Bradski GR (1998) Real time face and object tracking as a component of a perceptual user interface. In: Proc. of IEEE Workshop on Applications of Computer Vision.</mixed-citation>
</ref>
<ref id="pone.0070173-Breiman1">
<label>53</label>
<mixed-citation publication-type="journal">
<name>
<surname>Breiman</surname>
<given-names>L</given-names>
</name>
(
<year>2001</year>
)
<article-title>Random forests</article-title>
.
<source>Machine Learning</source>
<volume>45</volume>
:
<fpage>5</fpage>
<lpage>32</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Hastie1">
<label>54</label>
<mixed-citation publication-type="other">Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. New York: Springer.</mixed-citation>
</ref>
<ref id="pone.0070173-Rijsbergen1">
<label>55</label>
<mixed-citation publication-type="other">Rijsbergen CV (1979) Information Retrieval, Second Edition. London: Butterworths.</mixed-citation>
</ref>
<ref id="pone.0070173-Freeman1">
<label>56</label>
<mixed-citation publication-type="journal">
<name>
<surname>Freeman</surname>
<given-names>H</given-names>
</name>
,
<name>
<surname>Shapira</surname>
<given-names>R</given-names>
</name>
(
<year>1975</year>
)
<article-title>Determining the minimum-area encasing rectangle for an arbitrary closed curve</article-title>
.
<source>Comm ACM</source>
<volume>18</volume>
:
<fpage>409</fpage>
<lpage>413</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Belongie1">
<label>57</label>
<mixed-citation publication-type="journal">
<name>
<surname>Belongie</surname>
<given-names>S</given-names>
</name>
,
<name>
<surname>Malik</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Puzicha</surname>
<given-names>J</given-names>
</name>
(
<year>2002</year>
)
<article-title>Shape matching and object recognition using shape contexts</article-title>
.
<source>IEEE Trans PAMI</source>
<volume>24</volume>
:
<fpage>509</fpage>
<lpage>522</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Wang3">
<label>58</label>
<mixed-citation publication-type="other">Wang X, Bai X, Liu W, Latecki LJ (2011) Feature context for image classification and object detection. In: Proc. of CVPR.</mixed-citation>
</ref>
<ref id="pone.0070173-Gu1">
<label>59</label>
<mixed-citation publication-type="other">Gu C, Lim J, Arbelaez P, Malik J (2009) Recognition using regions. In: Proc. of CVPR.</mixed-citation>
</ref>
<ref id="pone.0070173-Bosch1">
<label>60</label>
<mixed-citation publication-type="other">Bosch A, Zisserman A, Munoz X (2007) Representing shape with a spatial pyramid kernel. In: Proc. of CIVR.</mixed-citation>
</ref>
<ref id="pone.0070173-Lazebnik1">
<label>61</label>
<mixed-citation publication-type="other">Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proc. of CVPR.</mixed-citation>
</ref>
<ref id="pone.0070173-Peng1">
<label>62</label>
<mixed-citation publication-type="journal">
<name>
<surname>Peng</surname>
<given-names>Y</given-names>
</name>
,
<name>
<surname>Ganesh</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Wright</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Xu</surname>
<given-names>W</given-names>
</name>
,
<name>
<surname>Ma</surname>
<given-names>Y</given-names>
</name>
(
<year>2012</year>
)
<article-title>RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images</article-title>
.
<source>IEEE Trans PAMI</source>
<volume>34</volume>
:
<fpage>2233</fpage>
<lpage>2246</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Ruiz1">
<label>63</label>
<mixed-citation publication-type="other">Ruiz A (2002) Affine alignment for stroke classification. In: 8th International Workshop on Fron-tiers in Handwriting Recognition.</mixed-citation>
</ref>
<ref id="pone.0070173-Shechtman1">
<label>64</label>
<mixed-citation publication-type="other">Shechtman E, Irani M (2007) Matching local self-similarities across images and videos. In: Proc. of CVPR.</mixed-citation>
</ref>
<ref id="pone.0070173-Everingham1">
<label>65</label>
<mixed-citation publication-type="journal">
<name>
<surname>Everingham</surname>
<given-names>M</given-names>
</name>
,
<name>
<surname>Gool</surname>
<given-names>LV</given-names>
</name>
,
<name>
<surname>Williams</surname>
<given-names>CKI</given-names>
</name>
,
<name>
<surname>Winn</surname>
<given-names>J</given-names>
</name>
,
<name>
<surname>Zisserman</surname>
<given-names>A</given-names>
</name>
(
<year>2010</year>
)
<article-title>The PASCAL Visual Object Classes (VOC) challenge</article-title>
.
<source>IJCV</source>
<volume>88</volume>
:
<fpage>303</fpage>
<lpage>338</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Wolf1">
<label>66</label>
<mixed-citation publication-type="journal">
<name>
<surname>Wolf</surname>
<given-names>C</given-names>
</name>
,
<name>
<surname>Jolion</surname>
<given-names>JM</given-names>
</name>
(
<year>2006</year>
)
<article-title>Object count/area graphs for the evaluation of object detection and segmentation algorithms</article-title>
.
<source>IJDAR</source>
<volume>8</volume>
:
<fpage>280</fpage>
<lpage>296</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Zhang1">
<label>67</label>
<mixed-citation publication-type="journal">
<name>
<surname>Zhang</surname>
<given-names>Z</given-names>
</name>
,
<name>
<surname>Ganesh</surname>
<given-names>A</given-names>
</name>
,
<name>
<surname>Liang</surname>
<given-names>X</given-names>
</name>
,
<name>
<surname>Ma</surname>
<given-names>Y</given-names>
</name>
(
<year>2012</year>
)
<article-title>TILT: Transform invariant low-rank textures</article-title>
.
<source>IJCV</source>
<volume>99</volume>
:
<fpage>1</fpage>
<lpage>24</lpage>
</mixed-citation>
</ref>
<ref id="pone.0070173-Wintone1">
<label>68</label>
<mixed-citation publication-type="other">Wintone (2012) TH-OCR. Available:
<ext-link ext-link-type="uri" xlink:href="http://www.wintone.com.cn/en/Products/detail118.aspx">http://www.wintone.com.cn/en/Products/detail118.aspx</ext-link>
Accessed 2013 Jun. 22.</mixed-citation>
</ref>
<ref id="pone.0070173-Nagy1">
<label>69</label>
<mixed-citation publication-type="other">Nagy R, Dicker A, Meyer-Wegener K (2011) NEOCR: A configurable dataset for natural image text recognition. In: CBDAR Workshop at ICDAR.</mixed-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Pmc/Curation
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000175 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd -nk 000175 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Pmc
   |étape=   Curation
   |type=    RBID
   |clé=     PMC:3734103
   |texte=   Rotation-Invariant Features for Multi-Oriented Text Detection in Natural Images
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/Pmc/Curation/RBID.i   -Sk "pubmed:23940544" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/Pmc/Curation/biblio.hfd   \
       | NlmPubMed2Wicri -a OcrV1 

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024