Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context

Identifieur interne : 000685 ( PascalFrancis/Corpus ); précédent : 000684; suivant : 000686

Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context

Auteurs : A. Kacem ; A. Belaïd ; M. Ben Ahmed

Source :

RBID : Pascal:02-0104694

Descripteurs français

English descriptors

Abstract

This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%.

Notice en format standard (ISO 2709)

Pour connaître la documentation sur le format Inist Standard.

pA  
A01 01  1    @0 1433-2833
A03   1    @0 Int. j. doc. anal. recognit. : (Print)
A05       @2 4
A06       @2 2
A08 01  1  ENG  @1 Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context
A11 01  1    @1 KACEM (A.)
A11 02  1    @1 BELAÏD (A.)
A11 03  1    @1 BEN AHMED (M.)
A14 01      @1 ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali @2 2040 Radès @3 TUN @Z 1 aut.
A14 02      @1 LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239 @2 54506 Vandoeuvre Nancy @3 FRA @Z 2 aut.
A14 03      @1 RIADI-Tunisia @3 TUN @Z 3 aut.
A20       @1 97-108
A21       @1 2001
A23 01      @0 ENG
A43 01      @1 INIST @2 26790 @5 354000099533410030
A44       @0 0000 @1 © 2002 INIST-CNRS. All rights reserved.
A45       @0 20 ref.
A47 01  1    @0 02-0104694
A60       @1 P
A61       @0 A
A64 01  1    @0 International journal on document analysis and recognition : (Print)
A66 01      @0 DEU
C01 01    ENG  @0 This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%.
C02 01  X    @0 001D02C03
C03 01  X  FRE  @0 Méthode heuristique @5 01
C03 01  X  ENG  @0 Heuristic method @5 01
C03 01  X  SPA  @0 Método heurístico @5 01
C03 02  X  FRE  @0 Segmentation @5 02
C03 02  X  ENG  @0 Segmentation @5 02
C03 02  X  SPA  @0 Segmentación @5 02
C03 03  X  FRE  @0 Caractère imprimé @5 03
C03 03  X  ENG  @0 Printed character @5 03
C03 03  X  SPA  @0 Carácter impreso @5 03
C03 04  X  FRE  @0 Reconnaissance caractère @5 04
C03 04  X  ENG  @0 Character recognition @5 04
C03 04  X  SPA  @0 Reconocimiento carácter @5 04
C03 05  X  FRE  @0 Logique floue @5 05
C03 05  X  ENG  @0 Fuzzy logic @5 05
C03 05  X  SPA  @0 Lógica difusa @5 05
C03 06  X  FRE  @0 Logique mathématique @5 06
C03 06  X  ENG  @0 Mathematical logic @5 06
C03 06  X  SPA  @0 Lógica matemática @5 06
C03 07  X  FRE  @0 Formule mathématique @5 07
C03 07  X  ENG  @0 Mathematical formula @5 07
C03 07  X  SPA  @0 Fórmula matemática @5 07
N21       @1 056
N82       @1 PSI

Format Inist (serveur)

NO : PASCAL 02-0104694 INIST
ET : Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context
AU : KACEM (A.); BELAÏD (A.); BEN AHMED (M.)
AF : ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali/2040 Radès/Tunisie (1 aut.); LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239/54506 Vandoeuvre Nancy/France (2 aut.); RIADI-Tunisia/Tunisie (3 aut.)
DT : Publication en série; Niveau analytique
SO : International journal on document analysis and recognition : (Print); ISSN 1433-2833; Allemagne; Da. 2001; Vol. 4; No. 2; Pp. 97-108; Bibl. 20 ref.
LA : Anglais
EA : This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%.
CC : 001D02C03
FD : Méthode heuristique; Segmentation; Caractère imprimé; Reconnaissance caractère; Logique floue; Logique mathématique; Formule mathématique
ED : Heuristic method; Segmentation; Printed character; Character recognition; Fuzzy logic; Mathematical logic; Mathematical formula
SD : Método heurístico; Segmentación; Carácter impreso; Reconocimiento carácter; Lógica difusa; Lógica matemática; Fórmula matemática
LO : INIST-26790.354000099533410030
ID : 02-0104694

Links to Exploration step

Pascal:02-0104694

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context</title>
<author>
<name sortKey="Kacem, A" sort="Kacem, A" uniqKey="Kacem A" first="A." last="Kacem">A. Kacem</name>
<affiliation>
<inist:fA14 i1="01">
<s1>ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali</s1>
<s2>2040 Radès</s2>
<s3>TUN</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author>
<name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">A. Belaïd</name>
<affiliation>
<inist:fA14 i1="02">
<s1>LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239</s1>
<s2>54506 Vandoeuvre Nancy</s2>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author>
<name sortKey="Ben Ahmed, M" sort="Ben Ahmed, M" uniqKey="Ben Ahmed M" first="M." last="Ben Ahmed">M. Ben Ahmed</name>
<affiliation>
<inist:fA14 i1="03">
<s1>RIADI-Tunisia</s1>
<s3>TUN</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">02-0104694</idno>
<date when="2001">2001</date>
<idno type="stanalyst">PASCAL 02-0104694 INIST</idno>
<idno type="RBID">Pascal:02-0104694</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000685</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context</title>
<author>
<name sortKey="Kacem, A" sort="Kacem, A" uniqKey="Kacem A" first="A." last="Kacem">A. Kacem</name>
<affiliation>
<inist:fA14 i1="01">
<s1>ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali</s1>
<s2>2040 Radès</s2>
<s3>TUN</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author>
<name sortKey="Belaid, A" sort="Belaid, A" uniqKey="Belaid A" first="A." last="Belaïd">A. Belaïd</name>
<affiliation>
<inist:fA14 i1="02">
<s1>LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239</s1>
<s2>54506 Vandoeuvre Nancy</s2>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
<author>
<name sortKey="Ben Ahmed, M" sort="Ben Ahmed, M" uniqKey="Ben Ahmed M" first="M." last="Ben Ahmed">M. Ben Ahmed</name>
<affiliation>
<inist:fA14 i1="03">
<s1>RIADI-Tunisia</s1>
<s3>TUN</s3>
<sZ>3 aut.</sZ>
</inist:fA14>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint>
<date when="2001">2001</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Character recognition</term>
<term>Fuzzy logic</term>
<term>Heuristic method</term>
<term>Mathematical formula</term>
<term>Mathematical logic</term>
<term>Printed character</term>
<term>Segmentation</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Méthode heuristique</term>
<term>Segmentation</term>
<term>Caractère imprimé</term>
<term>Reconnaissance caractère</term>
<term>Logique floue</term>
<term>Logique mathématique</term>
<term>Formule mathématique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%.</div>
</front>
</TEI>
<inist>
<standard h6="B">
<pA>
<fA01 i1="01" i2="1">
<s0>1433-2833</s0>
</fA01>
<fA03 i2="1">
<s0>Int. j. doc. anal. recognit. : (Print)</s0>
</fA03>
<fA05>
<s2>4</s2>
</fA05>
<fA06>
<s2>2</s2>
</fA06>
<fA08 i1="01" i2="1" l="ENG">
<s1>Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context</s1>
</fA08>
<fA11 i1="01" i2="1">
<s1>KACEM (A.)</s1>
</fA11>
<fA11 i1="02" i2="1">
<s1>BELAÏD (A.)</s1>
</fA11>
<fA11 i1="03" i2="1">
<s1>BEN AHMED (M.)</s1>
</fA11>
<fA14 i1="01">
<s1>ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali</s1>
<s2>2040 Radès</s2>
<s3>TUN</s3>
<sZ>1 aut.</sZ>
</fA14>
<fA14 i1="02">
<s1>LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239</s1>
<s2>54506 Vandoeuvre Nancy</s2>
<s3>FRA</s3>
<sZ>2 aut.</sZ>
</fA14>
<fA14 i1="03">
<s1>RIADI-Tunisia</s1>
<s3>TUN</s3>
<sZ>3 aut.</sZ>
</fA14>
<fA20>
<s1>97-108</s1>
</fA20>
<fA21>
<s1>2001</s1>
</fA21>
<fA23 i1="01">
<s0>ENG</s0>
</fA23>
<fA43 i1="01">
<s1>INIST</s1>
<s2>26790</s2>
<s5>354000099533410030</s5>
</fA43>
<fA44>
<s0>0000</s0>
<s1>© 2002 INIST-CNRS. All rights reserved.</s1>
</fA44>
<fA45>
<s0>20 ref.</s0>
</fA45>
<fA47 i1="01" i2="1">
<s0>02-0104694</s0>
</fA47>
<fA60>
<s1>P</s1>
</fA60>
<fA61>
<s0>A</s0>
</fA61>
<fA64 i1="01" i2="1">
<s0>International journal on document analysis and recognition : (Print)</s0>
</fA64>
<fA66 i1="01">
<s0>DEU</s0>
</fA66>
<fC01 i1="01" l="ENG">
<s0>This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%.</s0>
</fC01>
<fC02 i1="01" i2="X">
<s0>001D02C03</s0>
</fC02>
<fC03 i1="01" i2="X" l="FRE">
<s0>Méthode heuristique</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="ENG">
<s0>Heuristic method</s0>
<s5>01</s5>
</fC03>
<fC03 i1="01" i2="X" l="SPA">
<s0>Método heurístico</s0>
<s5>01</s5>
</fC03>
<fC03 i1="02" i2="X" l="FRE">
<s0>Segmentation</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="ENG">
<s0>Segmentation</s0>
<s5>02</s5>
</fC03>
<fC03 i1="02" i2="X" l="SPA">
<s0>Segmentación</s0>
<s5>02</s5>
</fC03>
<fC03 i1="03" i2="X" l="FRE">
<s0>Caractère imprimé</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="ENG">
<s0>Printed character</s0>
<s5>03</s5>
</fC03>
<fC03 i1="03" i2="X" l="SPA">
<s0>Carácter impreso</s0>
<s5>03</s5>
</fC03>
<fC03 i1="04" i2="X" l="FRE">
<s0>Reconnaissance caractère</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="ENG">
<s0>Character recognition</s0>
<s5>04</s5>
</fC03>
<fC03 i1="04" i2="X" l="SPA">
<s0>Reconocimiento carácter</s0>
<s5>04</s5>
</fC03>
<fC03 i1="05" i2="X" l="FRE">
<s0>Logique floue</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="ENG">
<s0>Fuzzy logic</s0>
<s5>05</s5>
</fC03>
<fC03 i1="05" i2="X" l="SPA">
<s0>Lógica difusa</s0>
<s5>05</s5>
</fC03>
<fC03 i1="06" i2="X" l="FRE">
<s0>Logique mathématique</s0>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="X" l="ENG">
<s0>Mathematical logic</s0>
<s5>06</s5>
</fC03>
<fC03 i1="06" i2="X" l="SPA">
<s0>Lógica matemática</s0>
<s5>06</s5>
</fC03>
<fC03 i1="07" i2="X" l="FRE">
<s0>Formule mathématique</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="ENG">
<s0>Mathematical formula</s0>
<s5>07</s5>
</fC03>
<fC03 i1="07" i2="X" l="SPA">
<s0>Fórmula matemática</s0>
<s5>07</s5>
</fC03>
<fN21>
<s1>056</s1>
</fN21>
<fN82>
<s1>PSI</s1>
</fN82>
</pA>
</standard>
<server>
<NO>PASCAL 02-0104694 INIST</NO>
<ET>Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context</ET>
<AU>KACEM (A.); BELAÏD (A.); BEN AHMED (M.)</AU>
<AF>ENSI-RIADI, 77 rue de Carthage, Cité Mohamed Ali/2040 Radès/Tunisie (1 aut.); LORIA-CNRS, Bâtiment LORIA, compus scientifique BP 239/54506 Vandoeuvre Nancy/France (2 aut.); RIADI-Tunisia/Tunisie (3 aut.)</AF>
<DT>Publication en série; Niveau analytique</DT>
<SO>International journal on document analysis and recognition : (Print); ISSN 1433-2833; Allemagne; Da. 2001; Vol. 4; No. 2; Pp. 97-108; Bibl. 20 ref.</SO>
<LA>Anglais</LA>
<EA>This paper describes a new method to segment printed mathematical documents precisely and extract formulas automatically from their images. Unlike classical methods, it is more directed towards segmentation rather than recognition, isolating mathematical formulas outside and inside text-lines. Our ultimate goal is to delimit parts of text that could disturb OCR applications, not yet trained for formula recognition and restructuring. The method is based on a global and a local segmentation. The global segmentation separates isolated formulas from the text lines using a primary labeling. The local segmentation propagates the context around the mathematical operators met to discard embedded formulas from plain text. The primary labeling identifies some mathematical symbols by models created at a learning step using fuzzy logic. The secondary labeling reinforces the results of the primary labeling and locates the subscripts and the superscripts inside the text. A heuristic has been defined that guides this automatic process. In this paper, the different modules making up the automated segmentation of mathematical document system are presented with examples of results. Experiments carried out on some commonly seen mathematical documents show that our proposed method can achieve quite satisfactory rates, making mathematical formula extraction more feasible for real-world applications. The average rate of primary labeling of mathematical operators is about 95.3% and their secondary labeling can improve the rate by about 4%. The formula extraction rate, evaluated with 300 formulas and 100 mathematical documents having variable complexity, is close to 93%.</EA>
<CC>001D02C03</CC>
<FD>Méthode heuristique; Segmentation; Caractère imprimé; Reconnaissance caractère; Logique floue; Logique mathématique; Formule mathématique</FD>
<ED>Heuristic method; Segmentation; Printed character; Character recognition; Fuzzy logic; Mathematical logic; Mathematical formula</ED>
<SD>Método heurístico; Segmentación; Carácter impreso; Reconocimiento carácter; Lógica difusa; Lógica matemática; Fórmula matemática</SD>
<LO>INIST-26790.354000099533410030</LO>
<ID>02-0104694</ID>
</server>
</inist>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/PascalFrancis/Corpus
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000685 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/PascalFrancis/Corpus/biblio.hfd -nk 000685 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    PascalFrancis
   |étape=   Corpus
   |type=    RBID
   |clé=     Pascal:02-0104694
   |texte=   Automatic extraction of printed mathematical formulas using fuzzy logic and propagation of context
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024